[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224698#comment-13224698 ]
Eli Collins commented on HDFS-3004: ----------------------------------- Overall approach looks good. - Wr edit logs in the namenode directory that "seem" to have a higher txid than the current txid, isn't the idea that we have an option to actually truncate the last edit from the log? Ie in this patch you're asking if the user would like to truncate but not actually truncating - Is the move of the re-check of maxSeenTxid cleanup or actually necessary now? I agree the re-check doesn't look necessary though now we bail before adding found images if we can't find the maxSeenTxId in the SD images, not sure that's OK. - logTruncateMessage should probably be WARN instead of ERROR since we're doing it intentionally (ie this code path isn't an error case), but we want it to have a high log level so we always see it. - In the arg checking loop can just test for one additional argument rather than looping since we only support 1 argument - Looks like loadEditRecords used to throw EditLogInputException in cases it now throws IOE. Also, let's pull the recovery code out to a separate method vs implementing inline in the catch block. It may even make sense to have a separate loadEditRecordsWithRecovery method - Needs some more test cases, eg w/ and w/o yes to all, and that if you restart the cluster after the recovery the fs state matches the intended state (ie if the last edit created a file check that file is not present, but the rest of the state is in order) - Easier if RecoveryContext#ask used var args? - New files need the apache license header - Testing? Aside from running the tests would be good to try from a tarball install and start the NN with recovery, check the various options Style nits: - I'd rename "yesToAll" to something like "recoverYesToAll" so its clear that its recovery related - Method declarations should have an empty line between them - would rename EditLogInputStream var "l" "editIn" to be consistent with the rest of the file. And long "e" somethign more descriptive like "txId" - Both brackets go on the same line in else and catch clauses (eg "} else {", eg "} catch (..) {" - "can't understand" and "e.getMessage()" lines need indentation - use postfix increment to be consistent (eg txId++ vs ++txId) when it doesn't matter - the opening bracket for a method goes on the same line as the throws clause (eg "throws IOE {") > Implement Recovery Mode > ----------------------- > > Key: HDFS-3004 > URL: https://issues.apache.org/jira/browse/HDFS-3004 > Project: Hadoop HDFS > Issue Type: New Feature > Components: tools > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-3004.006.patch, > HDFS-3004__namenode_recovery_tool.txt > > > When the NameNode metadata is corrupt for some reason, we want to be able to > fix it. Obviously, we would prefer never to get in this case. In a perfect > world, we never would. However, bad data on disk can happen from time to > time, because of hardware errors or misconfigurations. In the past we have > had to correct it manually, which is time-consuming and which can result in > downtime. > Recovery mode is initialized by the system administrator. When the NameNode > starts up in Recovery Mode, it will try to load the FSImage file, apply all > the edits from the edits log, and then write out a new image. Then it will > shut down. > Unlike in the normal startup process, the recovery mode startup process will > be interactive. When the NameNode finds something that is inconsistent, it > will prompt the operator as to what it should do. The operator can also > choose to take the first option for all prompts by starting up with the '-f' > flag, or typing 'a' at one of the prompts. > I have reused as much code as possible from the NameNode in this tool. > Hopefully, the effort that was spent developing this will also make the > NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira