[ 
https://issues.apache.org/jira/browse/HDFS-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832706#action_12832706
 ] 

Todd Lipcon commented on HDFS-955:
----------------------------------

I worked through this a bit last night. Here are some options for a solution.

h3. 1. Add "undo log" file to storage directory

In this solution, we add a new file called "undolog" in each storage directory. 
Whenever we're in the midst of a transition, we write some bit of data in this 
file that explains what the proper rollback procedure is. Thus, for the 
checkpoint from the checkpoint node, we'd write a file that says "if IMAGE_NEW 
is complete, use IMAGE_NEW + EDITS_NEW. Otherwise use IMAGE + EDITS + 
EDITS_NEW". For the saveNamespace operation, we'd write "If IMAGE_NEW is 
complete, use IMAGE_NEW. Otherwise use IMAGE + EDITS + EDITS_NEW".

This has the advantage of making the recovery choices explicit during all state 
transitions - we're forced to think carefully after each step of the operation 
in order to maintain the undo instructions.

On the downside, it's more complexity.

h3. 2. Don't allow -saveNamespace when the logs are in ROLLED state

I don't like this one at all, but it would allow us to always use the IMAGE_NEW 
+ EDITS_NEW recovery.

h3. 3. Redesign rolling to not reuse filenames

This is a much bigger change, but I think it would also help simplify a lot of 
the code. The proposal here is to manage edit logs in a way that's similar to 
what MySQL does. Specifically, instead of IMAGE and IMAGE_NEW plus EDITS and 
EDITS_NEW, we simply have a monotonically increasing identifier on each log 
file. So, the state of the system starts with image_0 and edits_0. Logs may be 
rolled at any point, which increments edits_N. So in a normal operation we'd 
see:

image_0
edits_0 <- writing here

[roll edits]
image_0
edits_0
edits_1 <- writing here

[checkpoint node fetches image_0 and edits_0, and uploads images_1]

image_0 <- this is now "stale" and can be garbage collected later
image_1 <- this contains image_0 + edits_0
edits_0 <- this is also stale
edits_1 <- still being written

This design has many plusses in my view:
# Files never change names, and thus race conditions like HDFS-909 are less 
likely, so long as the current number is synchronized.
# Recovery is much simpler - you can always recover from image_n + edits_n 
through edits_max, so long as image_n is complete. Any incomplete or corrupt 
images can always be safely ignored so long as there is an earlier one, plus 
all the edit logs going back to that point.
# the fstime checking logic is simplified - an image made from image_N plus 
edits_N through edits_(M-1) is always going to be called image_M. Any image_M 
from any storage directory should be identical regardless of any ongoing rolls.
# edit logs and images can both be kept for some time window, simpifying backup 
and recovery a bit while also providing an easy mechanism for point-in-time 
recovery of the namespace. Although PITR is less than useful if data blocks are 
gone, this mechanism would make it impossible for a bug like HDFS-909 or 
HDFS-955 to lose edits, since files are never truncated or removed until after 
they're "stale".
# We no longer have to be careful about the NN's "rolled" vs "upload_done" vs 
"start" state - the logs are looked at as constantly rolling, and it's always 
clear where to apply a checkpoint image.

The downside, of course, is that it's a very big change, definitely not a 
candidate for backport, and could take a while.

h3. 4. Distinguish IMAGE_NEW_CKPT vs IMAGE_NEW_SAVED

Rather than having a single IMAGE_NEW filename like we do now, we could split 
it into IMAGE_NEW_CKPT and IMAGE_NEW_SAVED. The recovery mechanism for these 
would differ in that, if there is a completed IMAGE_NEW_CKPT, then it will 
recover IMAGE_NEW_CKPT + EDITS_NEW. If there is a completed IMAGE_NEW_SAVED, 
then it can truncate both EDITS and EDITS_NEW during recovery, since a saved 
namespace encompasses both.


Unfortunately, not one of these is a simple fix. If you have any proposals that 
are both simple and correct, I'd be very interested to hear them.

One thing I'd also like to consider more is the interaction of these processes 
with filesystem journaling. I'm not sure if ext3's data=ordered
journaling mode (probably the most common deployment configuration) guarantees 
quite enough ordering between different files that all of the above will work 
correctly in the event of host failures. I need to learn more about that and 
report back.

> FSImage.saveFSImage can lose edits
> ----------------------------------
>
>                 Key: HDFS-955
>                 URL: https://issues.apache.org/jira/browse/HDFS-955
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20.1, 0.21.0, 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>         Attachments: hdfs-955-unittest.txt, PurgeEditsBeforeImageSave.patch
>
>
> This is a continuation of a discussion from HDFS-909. The FSImage.saveFSImage 
> function (implementing dfsadmin -saveNamespace) can corrupt the NN storage 
> such that all current edits are lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to