[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228186#comment-13228186 ]
Eli Collins commented on HDFS-3004: ----------------------------------- bq. The first choice isn't always skip-- sometimes it's "truncate." Why would a user choose "always choose 1st"? The user doesn't know if future errors are skippable or not-skippable so when they select "always choose first" on a skippable prompt they don't know that they're signing up for a future truncate. Seems like we need to make the order consistent if we're going to give people a "Yes to all" option. - Per above, What is the "TODO: attempt to resynchronize stream here" for? - Should the catch of Throwable catch IOException like it used to? We're not trying to catch new types of exceptions in the non-recovery case right? - Do we need to sanity check dfs.namenode.num.checkpoints.retained in recovery mode? Ie since we do roll the log is there anyway that we could load an image/log, truncate it in recovery mode, then not retain the old log? - TestRecoverTruncatedEditLog still doesn't check that we actually truncated the log, eg even if we didn't truncate the log the test would still pass because the directory would still be there - What testing have you done? Would be good to try this on a tarball build with various corrupt and non-corrupt images/logs. > Implement Recovery Mode > ----------------------- > > Key: HDFS-3004 > URL: https://issues.apache.org/jira/browse/HDFS-3004 > Project: Hadoop HDFS > Issue Type: New Feature > Components: tools > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-3004.010.patch, > HDFS-3004__namenode_recovery_tool.txt > > > When the NameNode metadata is corrupt for some reason, we want to be able to > fix it. Obviously, we would prefer never to get in this case. In a perfect > world, we never would. However, bad data on disk can happen from time to > time, because of hardware errors or misconfigurations. In the past we have > had to correct it manually, which is time-consuming and which can result in > downtime. > Recovery mode is initialized by the system administrator. When the NameNode > starts up in Recovery Mode, it will try to load the FSImage file, apply all > the edits from the edits log, and then write out a new image. Then it will > shut down. > Unlike in the normal startup process, the recovery mode startup process will > be interactive. When the NameNode finds something that is inconsistent, it > will prompt the operator as to what it should do. The operator can also > choose to take the first option for all prompts by starting up with the '-f' > flag, or typing 'a' at one of the prompts. > I have reused as much code as possible from the NameNode in this tool. > Hopefully, the effort that was spent developing this will also make the > NameNode editLog and image processing even more robust than it already is. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira