[ https://issues.apache.org/jira/browse/HDFS-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832706#action_12832706 ]
Todd Lipcon commented on HDFS-955: ---------------------------------- I worked through this a bit last night. Here are some options for a solution. h3. 1. Add "undo log" file to storage directory In this solution, we add a new file called "undolog" in each storage directory. Whenever we're in the midst of a transition, we write some bit of data in this file that explains what the proper rollback procedure is. Thus, for the checkpoint from the checkpoint node, we'd write a file that says "if IMAGE_NEW is complete, use IMAGE_NEW + EDITS_NEW. Otherwise use IMAGE + EDITS + EDITS_NEW". For the saveNamespace operation, we'd write "If IMAGE_NEW is complete, use IMAGE_NEW. Otherwise use IMAGE + EDITS + EDITS_NEW". This has the advantage of making the recovery choices explicit during all state transitions - we're forced to think carefully after each step of the operation in order to maintain the undo instructions. On the downside, it's more complexity. h3. 2. Don't allow -saveNamespace when the logs are in ROLLED state I don't like this one at all, but it would allow us to always use the IMAGE_NEW + EDITS_NEW recovery. h3. 3. Redesign rolling to not reuse filenames This is a much bigger change, but I think it would also help simplify a lot of the code. The proposal here is to manage edit logs in a way that's similar to what MySQL does. Specifically, instead of IMAGE and IMAGE_NEW plus EDITS and EDITS_NEW, we simply have a monotonically increasing identifier on each log file. So, the state of the system starts with image_0 and edits_0. Logs may be rolled at any point, which increments edits_N. So in a normal operation we'd see: image_0 edits_0 <- writing here [roll edits] image_0 edits_0 edits_1 <- writing here [checkpoint node fetches image_0 and edits_0, and uploads images_1] image_0 <- this is now "stale" and can be garbage collected later image_1 <- this contains image_0 + edits_0 edits_0 <- this is also stale edits_1 <- still being written This design has many plusses in my view: # Files never change names, and thus race conditions like HDFS-909 are less likely, so long as the current number is synchronized. # Recovery is much simpler - you can always recover from image_n + edits_n through edits_max, so long as image_n is complete. Any incomplete or corrupt images can always be safely ignored so long as there is an earlier one, plus all the edit logs going back to that point. # the fstime checking logic is simplified - an image made from image_N plus edits_N through edits_(M-1) is always going to be called image_M. Any image_M from any storage directory should be identical regardless of any ongoing rolls. # edit logs and images can both be kept for some time window, simpifying backup and recovery a bit while also providing an easy mechanism for point-in-time recovery of the namespace. Although PITR is less than useful if data blocks are gone, this mechanism would make it impossible for a bug like HDFS-909 or HDFS-955 to lose edits, since files are never truncated or removed until after they're "stale". # We no longer have to be careful about the NN's "rolled" vs "upload_done" vs "start" state - the logs are looked at as constantly rolling, and it's always clear where to apply a checkpoint image. The downside, of course, is that it's a very big change, definitely not a candidate for backport, and could take a while. h3. 4. Distinguish IMAGE_NEW_CKPT vs IMAGE_NEW_SAVED Rather than having a single IMAGE_NEW filename like we do now, we could split it into IMAGE_NEW_CKPT and IMAGE_NEW_SAVED. The recovery mechanism for these would differ in that, if there is a completed IMAGE_NEW_CKPT, then it will recover IMAGE_NEW_CKPT + EDITS_NEW. If there is a completed IMAGE_NEW_SAVED, then it can truncate both EDITS and EDITS_NEW during recovery, since a saved namespace encompasses both. Unfortunately, not one of these is a simple fix. If you have any proposals that are both simple and correct, I'd be very interested to hear them. One thing I'd also like to consider more is the interaction of these processes with filesystem journaling. I'm not sure if ext3's data=ordered journaling mode (probably the most common deployment configuration) guarantees quite enough ordering between different files that all of the above will work correctly in the event of host failures. I need to learn more about that and report back. > FSImage.saveFSImage can lose edits > ---------------------------------- > > Key: HDFS-955 > URL: https://issues.apache.org/jira/browse/HDFS-955 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 0.20.1, 0.21.0, 0.22.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Blocker > Attachments: hdfs-955-unittest.txt, PurgeEditsBeforeImageSave.patch > > > This is a continuation of a discussion from HDFS-909. The FSImage.saveFSImage > function (implementing dfsadmin -saveNamespace) can corrupt the NN storage > such that all current edits are lost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.