[ https://issues.apache.org/jira/browse/HDFS-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aaron T. Myers updated HDFS-2305: --------------------------------- Attachment: hdfs-2305.1.patch > Running multiple 2NNs can result in corrupt file system > ------------------------------------------------------- > > Key: HDFS-2305 > URL: https://issues.apache.org/jira/browse/HDFS-2305 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.20.2 > Reporter: Aaron T. Myers > Assignee: Aaron T. Myers > Attachments: hdfs-2305-test.patch, hdfs-2305.0.patch, > hdfs-2305.1.patch > > > Here's the scenario: > * You run the NN and 2NN (2NN A) on the same machine. > * You don't have the address of the 2NN configured, so it's defaulting to > 127.0.0.1. > * There's another 2NN (2NN B) running on a second machine. > * When a 2NN is done checkpointing, it says "hey NN, I have an updated > fsimage for you. You can download it from this URL, which includes my IP > address, which is x" > And here's the steps that occur to cause this issue: > # Some edits happen. > # 2NN A (on the NN machine) does a checkpoint. All is dandy. > # Some more edits happen. > # 2NN B (on a different machine) does a checkpoint. It tells the NN "grab the > newly-merged fsimage file from 127.0.0.1" > # NN happily grabs the fsimage from 2NN A (the 2NN on the NN machine), which > is stale. > # NN renames edits.new file to edits. At this point the in-memory FS state is > fine, but the on-disk state is missing edits. > # The next time a 2NN (any 2NN) tries to do a checkpoint, it gets an > up-to-date edits file, with an outdated fsimage, and tries to apply those > edits to that fsimage. > # Kaboom. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira