[ https://issues.apache.org/jira/browse/HDFS-4596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603342#comment-13603342 ]
Hudson commented on HDFS-4596: ------------------------------ Integrated in Hadoop-Hdfs-trunk #1345 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1345/]) HDFS-4596. Shutting down namenode during checkpointing can lead to md5sum error. Contributed by Andrew Wang. (Revision 1456630) Result = FAILURE atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1456630 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/CheckpointFaultInjector.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java > Shutting down namenode during checkpointing can lead to md5sum error > -------------------------------------------------------------------- > > Key: HDFS-4596 > URL: https://issues.apache.org/jira/browse/HDFS-4596 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.0.0, 2.0.4-alpha > Reporter: Andrew Wang > Assignee: Andrew Wang > Fix For: 2.0.5-beta > > Attachments: hdfs-4596-1.patch > > > This is a really rare error that can hit if a NN shutdown happens during the > checkpointing process. > Checkpointing and restarting nominally looks like this: > # FSImage is written to a tmp file and then renamed > # MD5 file is written to a tmp file and then renamed > # NN is killed and restarted > # NN scans storage directories and picks up the renamed image file > # NN validates that the image file matches its md5 file > If the NN is killed before step 2 completes, this is what happens: > # FSImage is written to a tmp file and then renamed > # NN is killed and restarted (no MD5 file!) > # NN scans storage directories and picks up the renamed image file > # Since there's no matching MD5 file, NN errors out with a checksum error > I think we can fix this by inverting the order of writing the image then md5, > or inverting the order of reading the image then md5. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira