[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830313#action_12830313 ]
Todd Lipcon commented on HDFS-909: ---------------------------------- I think we're talking about different things. I'm discussing the saveFSImage that is called by FSNamesystem.saveNamespace, where you found the race above. FSImage.saveFSImage has this code: {noformat} 1240 if (dirType.isOfType(NameNodeDirType.IMAGE)) 1241 saveFSImage(getImageFile(sd, NameNodeFile.IMAGE_NEW)); 1242 if (dirType.isOfType(NameNodeDirType.EDITS)) { 1243 editLog.createEditLogFile(getImageFile(sd, NameNodeFile.EDITS)); 1244 File editsNew = getImageFile(sd, NameNodeFile.EDITS_NEW); 1245 if (editsNew.exists()) 1246 editLog.createEditLogFile(editsNew); 1247 } {noformat} On line 1243 we truncate EDITS. Then if EDITS_NEW exists, we truncate it on 1246. All of this happens when the NN is in safe mode, so there shouldn't be any new edits coming in in the first place. I'm contending that line 1243 and 1245 should both be deleted. We should always create the image as IMAGE_NEW (line 1241). Touching EDITS seems incorrect - what if the order of storage dirs is EDITS then IMAGE, so we run line 1243, kill our current edit log, and then crash before saving the current image? (this is possibly orthogonal to the issue you raised - even though there are no edits, there can be an ongoing sync. I think the NN should call editLog.waitForSyncToFinish before entering safemode to avoid this issue) > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > ----------------------------------------------------------------------------------------------------- > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS > Reporter: Cosmin Lehene > Assignee: Todd Lipcon > Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.