[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829262#action_12829262 ]
Todd Lipcon commented on HDFS-909: ---------------------------------- Hi Konstantin. Thanks for the in depth exploration of that race. Let me ask a question - sorry if I'm being dense here :) In FSImage.saveFSImage(boolean) (FSImage.java:1243) we create a new "current" edit log file before rolling the FSImage. This seems to be related to the race, and I don't quite understand the design here. It seems to me that the correct operation of saveFSImage should be: 1. Dump the image in IMAGE_NEW 2. Create a new empty edit log in EDITS_NEW (while leaving EDITS alone) 3. Call purgeEditLogs which rolls the new image/edits pair in. Analyzing the failure cases: - During step 1, if the NN restarts, it can recover from the preexisting IMAGE/EDITS as if the saveFSImage never started - Between step 1 and 2, it can recover from either the preexisting IMAGE/EDITS or the IMAGE_NEW. IMAGE_NEW is consistent and up to date because saveFSImage is synchronized, and thus it represents the exact state of the namesystem. - The same goes for step 2 - the empty edit log is correct against the new log. If we decide to recover from current instead of NEW at this point it would also be correct. - We must assume that step 3 performs rolling in a way that allows recovery - this is what allows rolling to work in general safely. In the implementation as I understand it, step 2 is implemented differently such that it calls createEditLogFile on EDITS, not just EDITS_NEW. This truncates the edits, which means we can only recover from the NEW image at this point. This makes for tricky recovery logic, since we have an old image with truncated edit log, plus a new image (which we don't have any great way of knowing is complete). Additionally, I don't think there's a guarantee that steps 1 and 2 happen in that order - the order of saveFSImage is determined by the order of the storage directories in the directory iterator. It seems like edits _usually_ come last, but in the case of failed/recovered storage directories, I don't think it's always true. If the order of step 1 and 2 are inverted, it could truncate the current edits log before saving the new image, and thus end up losing all those edits if the NN failed in between. Am I misunderstanding something? The simple fix to the issue you suggested above is to add a waitForSyncToFinish in createEditLogFile, but I'm worried there may be more subtle issues as described above. Thanks! > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > ----------------------------------------------------------------------------------------------------- > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS > Reporter: Cosmin Lehene > Assignee: Todd Lipcon > Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.