[ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829262#action_12829262
 ] 

Todd Lipcon commented on HDFS-909:
----------------------------------

Hi Konstantin. Thanks for the in depth exploration of that race.

Let me ask a question - sorry if I'm being dense here :)

In FSImage.saveFSImage(boolean) (FSImage.java:1243) we create a new "current" 
edit log file before rolling the FSImage.
This seems to be related to the race, and I don't quite understand the design 
here. It seems to me that the correct operation
of saveFSImage should be:

1. Dump the image in IMAGE_NEW
2. Create a new empty edit log in EDITS_NEW (while leaving EDITS alone)
3. Call purgeEditLogs which rolls the new image/edits pair in.

Analyzing the failure cases:
- During step 1, if the NN restarts, it can recover from the preexisting 
IMAGE/EDITS as if the saveFSImage never started
- Between step 1 and 2, it can recover from either the preexisting IMAGE/EDITS 
or the IMAGE_NEW.
IMAGE_NEW is consistent and up to date because saveFSImage is synchronized, and 
thus it represents the exact state
of the namesystem.
- The same goes for step 2 - the empty edit log is correct against the new log. 
If we decide to recover from current instead
of NEW at this point it would also be correct.
- We must assume that step 3 performs rolling in a way that allows recovery - 
this is what allows rolling to work in general
safely.

In the implementation as I understand it, step 2 is implemented differently 
such that it calls createEditLogFile on EDITS, not
just EDITS_NEW. This truncates the edits, which means we can only recover from 
the NEW image at this point. This makes
for tricky recovery logic, since we have an old image with truncated edit log, 
plus a new image (which we don't have any great way
of knowing is complete). Additionally, I don't think there's a guarantee that 
steps 1 and 2 happen in that order - the order of
saveFSImage is determined by the order of the storage directories in the 
directory iterator. It seems like edits _usually_
come last, but in the case of failed/recovered storage directories, I don't 
think it's always true. If the order of step 1 and 2 are
inverted, it could truncate the current edits log before saving the new image, 
and thus end up losing all those edits if the
NN failed in between.

Am I misunderstanding something? The simple fix to the issue you suggested 
above is to add a waitForSyncToFinish in
createEditLogFile, but I'm worried there may be more subtle issues as described 
above.

Thanks!

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-909
>                 URL: https://issues.apache.org/jira/browse/HDFS-909
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
>         Environment: CentOS
>            Reporter: Cosmin Lehene
>            Assignee: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to