[ https://issues.apache.org/jira/browse/HDFS-4596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aaron T. Myers updated HDFS-4596: --------------------------------- Resolution: Fixed Fix Version/s: 2.0.5-beta Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've just committed this to trunk and branch-2. Thanks a lot for the contribution, Andrew. > Shutting down namenode during checkpointing can lead to md5sum error > -------------------------------------------------------------------- > > Key: HDFS-4596 > URL: https://issues.apache.org/jira/browse/HDFS-4596 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.0.0, 2.0.4-alpha > Reporter: Andrew Wang > Assignee: Andrew Wang > Fix For: 2.0.5-beta > > Attachments: hdfs-4596-1.patch > > > This is a really rare error that can hit if a NN shutdown happens during the > checkpointing process. > Checkpointing and restarting nominally looks like this: > # FSImage is written to a tmp file and then renamed > # MD5 file is written to a tmp file and then renamed > # NN is killed and restarted > # NN scans storage directories and picks up the renamed image file > # NN validates that the image file matches its md5 file > If the NN is killed before step 2 completes, this is what happens: > # FSImage is written to a tmp file and then renamed > # NN is killed and restarted (no MD5 file!) > # NN scans storage directories and picks up the renamed image file > # Since there's no matching MD5 file, NN errors out with a checksum error > I think we can fix this by inverting the order of writing the image then md5, > or inverting the order of reading the image then md5. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira