Xiaoqiao He created HDFS-15746:
----------------------------------

             Summary: Standby NameNode crash when replay editlog
                 Key: HDFS-15746
                 URL: https://issues.apache.org/jira/browse/HDFS-15746
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
            Reporter: Xiaoqiao He
            Assignee: Xiaoqiao He


Standby NameNode meet NPE and crash when replay editlog, After dig log and 
source code, Not found the root cause. But some information may be useful for 
this case.
a. before SBN crash, ANN do one lease recovery.
{code:java}
2020-12-23 12:37:45,946 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
NameSystem.internalReleaseLease: $PATH has not been closed. Lease recovery is 
in progress. RecoveryId = 21696709510 for block blk_*_21658833701
{code}
b. then one Datanode Volumn failed which manage one replica of 
blk_*_21658833701 after lease recovery.
c. after half one hour, SBN crash because NPE as following.
{code:java}
2020-12-23 13:13:36,703 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=$PATH, replication=3, 
mtime=1608698268201, atime=1608343529481, blockSize=268435456, 
blocks=[blk_$i_$j], permissions=user:group:rw-r--r--, aclEntries=null, 
clientName=, clientMachine=, overwrite=false, storagePolicyId=0, 
opCode=OP_CLOSE, txid=$txid]
java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.setGenerationStampAndVerifyReplicas(BlockInfo.java:455)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.commitBlock(BlockInfo.java:476)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1248)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1065)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:244)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:152)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1706)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:428)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
2020-12-23 13:13:36,703 ERROR org.apache.hadoop.ipc.Server: Error in Reader
java.nio.channels.ClosedChannelException
        at 
java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197)
        at 
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:1053)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:1034)
2020-12-23 13:13:36,703 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap 
updated: 10.16.39.26:50010 is added to blk_22374572883_21672067156 size 58762255
2020-12-23 13:13:36,704 FATAL 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error 
encountered while tailing edits. Shutting down standby NN.
java.io.IOException: java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:254)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:152)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1706)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:428)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
Caused by: java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.setGenerationStampAndVerifyReplicas(BlockInfo.java:455)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.commitBlock(BlockInfo.java:476)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1248)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1065)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:244)
        ... 12 more
{code}

Not very clear about the relation between [lease recovery/volumn failed/sbn 
crash], but I think we should catch null when remove stale Replicas to avoid 
this fatal. 
Our production version is 2.*, and IMO this issue also exist at trunk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to