Xiao Chen created HDFS-12369: -------------------------------- Summary: Edit log corruption due to hard lease recovery of not-closed file Key: HDFS-12369 URL: https://issues.apache.org/jira/browse/HDFS-12369 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Xiao Chen Assignee: Xiao Chen
HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations of client operations. Recently, we have observed NN not able to start with the following exception: {noformat} 2017-08-17 14:32:18,418 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode. java.io.FileNotFoundException: File does not exist: /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615) {noformat} Quoting a nicely analysed edits: {quote} In the edits logged about 1 hour later, we see this failing OP_CLOSE. The sequence in the edits shows the file going through: OPEN ADD_BLOCK CLOSE ADD_BLOCK # perhaps this was an append DELETE (about 1 hour later) CLOSE It is interesting that there was no CLOSE logged before the delete. {quote} Grepping that file name, it turns out the close was triggered by lease reaching hard limit. {noformat} 2017-08-16 15:05:45,927 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending creates: 75], src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M 2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M closed. {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org