[ https://issues.apache.org/jira/browse/HDFS-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kihwal Lee updated HDFS-10729: ------------------------------ Status: Patch Available (was: Open) > NameNode crashes when loading edits because max directory items is exceeded > --------------------------------------------------------------------------- > > Key: HDFS-10729 > URL: https://issues.apache.org/jira/browse/HDFS-10729 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.7.2 > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > Priority: Critical > Attachments: HDFS-10729.001.patch > > > We encountered a bug where Standby NameNode crashes due to an NPE when > loading edits. > {noformat} > 2016-08-05 15:06:00,983 ERROR > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception > on operation AddOp [length=0, inodeId=789272719, path=[path], replication=3, > mtime=1470379597935, atime=1470379597935, blockSize=134217728, blocks=[], > permissions=<user>:supergroup:rw-r--r--, aclEntries=null, > clientName=DFSClient_NONMAPREDUCE_1495395702_1, clientMachine=10.210.119.136, > overwrite=true, RpcClientId=a1512eeb-65e4-43dc-8aa8-d7a1af37ed30, > RpcCallId=417, storagePolicyId=0, opCode=OP_ADD, txid=4212503758] > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFileEncryptionInfo(FSDirectory.java:2914) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.createFileStatus(FSDirectory.java:2469) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:375) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:810) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1651) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:410) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297) > {noformat} > The NameNode crashes and can not be restarted. After some research, we turned > on debug log of org.apache.hadoop.hdfs.StateChange, restart the NN, and we > saw the following exception which induced NPE: > {noformat} > 16/08/07 18:51:15 DEBUG hdfs.StateChange: DIR* > FSDirectory.unprotectedAddFile: exception when add [path] to the file system > org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException: > The directory item limit of [path] is exceeded: limit=1048576 items=1049332 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2060) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2112) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:2081) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:1900) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:368) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:365) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:810) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$1.run(EditLogTailer.java:188) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$1.run(EditLogTailer.java:182) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at > org.apache.hadoop.security.SecurityUtil.doAsUser(SecurityUtil.java:445) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUser(SecurityUtil.java:426) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.catchupDuringFailover(EditLogTailer.java:182) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1205) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1762) > at > org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61) > at > org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1635) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1351) > at > org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) > {noformat} > The exception is thrown, caught and logged at debug level in > {{FSDirectory#unprotectedAddFile}}. Afterwards, a null is returned, which is > used and subsequently caused NPE. HDFS-7567 reported a similar bug, but it > was deemed the NPE could only be thrown if edit is corrupt. However, here we > see an example where NPE could be thrown without corrupt edits. > Looks like the maximum number of items per directory is exceeded. This is > similar to HDFS-6102 and HDFS-7482, but this happens at loading edits, rather > than loading fsimage. > A possible workaround is to increas the value of > {{dfs.namenode.fs-limits.max-directory-items}} to 6400000. But I am not sure > if it would cause any side effects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org