[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226378#comment-13226378 ]
Eli Collins commented on HDFS-3004: ----------------------------------- Your comments above make sense, thanks for the explanation. Comments on latest patch: - HDFS-2709 (hash 110b6d0) introduced EditLogInputException and used to have places where it was caught explicitly, that they just catch IOE, so given that you we no longer throw this either you can remove the class entirely - In logTruncateMessage we should log something like "stopping edit log load at position X" instead of saying we're truncating it because we're not actually truncating the log (from the user's perspective) - Isn't "always select the first choice" effectively "always skip"? Better to call it that as users might think it means use the previously selected option for all future choices (eg if I chose "skip" then chose "try to fix" then "always choose 1st" I might not have meant to "always skip"). - The conditional on "answer" is probably more readable as a switch, wasn't clear that the else clause was always "a" and therefore that's why we call recovery.setAlwaysChooseFirst() - What is the "TODO: attempt to resynchronize stream here" for? - Should use "s".equals(answer) instead of answer == "s" etc since if for some reason RecoveryContext doesn't return the exact object it was passed in the future this would break - Should RC#ask should log as info instead of error for prompt and automatically choosing log - RC#ask javadoc needs to be updated to match the method. Also, "his choice" -> "their choice" =P - RecoveryContext could use a high-level javadoc with a sentence or two since the name is pretty generic and the use is very specific - Can s/LOG.error/LOG.fatal/ in NN.java for recovery failed case - NN#printUsage has two IMPORT lines - ++i still used in a couple files - brackets on their own line still need fixing eg "} else if {" - Why does TestRecoverTruncatedEditLog make the same dir 21 times? Maybe you mean to append "i" to the path? The test should corrupt an operation that mutates the namespace (vs the last op which I believe is an op to finalize the log segment) so you can test that that edit is not present when you reload (eg corrupt the edit to mkdir /foo then assert /foo does not exist in the namespace) > Implement Recovery Mode > ----------------------- > > Key: HDFS-3004 > URL: https://issues.apache.org/jira/browse/HDFS-3004 > Project: Hadoop HDFS > Issue Type: New Feature > Components: tools > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-3004.008.patch, > HDFS-3004__namenode_recovery_tool.txt > > > When the NameNode metadata is corrupt for some reason, we want to be able to > fix it. Obviously, we would prefer never to get in this case. In a perfect > world, we never would. However, bad data on disk can happen from time to > time, because of hardware errors or misconfigurations. In the past we have > had to correct it manually, which is time-consuming and which can result in > downtime. > Recovery mode is initialized by the system administrator. When the NameNode > starts up in Recovery Mode, it will try to load the FSImage file, apply all > the edits from the edits log, and then write out a new image. Then it will > shut down. > Unlike in the normal startup process, the recovery mode startup process will > be interactive. When the NameNode finds something that is inconsistent, it > will prompt the operator as to what it should do. The operator can also > choose to take the first option for all prompts by starting up with the '-f' > flag, or typing 'a' at one of the prompts. > I have reused as much code as possible from the NameNode in this tool. > Hopefully, the effort that was spent developing this will also make the > NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira