[ https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218364#comment-13218364 ]
Todd Lipcon commented on HDFS-3004: ----------------------------------- Thanks for the doc. Makes sense. The only addition I'd make is that I think it would make sense to run it interactively, like "fsck" without the "-y" flag. Each question can have "yes/no/yes-all/no-all" type choices (where "all" would answer the same to all following questions of the same type) > Create Offline NameNode recovery tool > ------------------------------------- > > Key: HDFS-3004 > URL: https://issues.apache.org/jira/browse/HDFS-3004 > Project: Hadoop HDFS > Issue Type: New Feature > Components: tools > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-3004__namenode_recovery_tool.txt > > > We've been talking about creating a tool which can process NameNode edit logs > and image files offline. > This tool would be similar to a fsck for a conventional filesystem. It would > detect inconsistencies and malformed data. In cases where it was possible, > and the operator asked for it, it would try to correct the inconsistency. > It's probably better to call this "nameNodeRecovery" or similar, rather than > "fsck," since we already have a separate and unrelated mechanism which we > refer to as fsck. > The use case here is that the NameNode data is corrupt for some reason, and > we want to fix it. Obviously, we would prefer never to get in this case. In > a perfect world, we never would. However, bad data on disk can happen from > time to time, because of hardware errors or misconfigurations. In the past > we have had to correct it manually, which is time-consuming and which can > result in downtime. > I would like to reuse as much code as possible from the NameNode in this > tool. Hopefully, the effort that is spent developing this will also make the > NameNode editLog and image processing even more robust than it already is. > Another approach that we have discussed is NOT having an offline tool, but > just having a switch supplied to the NameNode, like "—auto-fix" or > "—force-fix". In that case, the NameNode would attempt to "guess" when data > was missing or incomplete in the EditLog or Image-- rather than aborting as > it does now. Like the proposed fsck tool, this switch could be used to get > users back on their feet quickly after a problem developed. I am not in > favor of this approach, because there is a danger that users could supply > this flag in cases where it is not appropriate. This risk does not exist for > an offline fsck tool, since it would have to be run explicitly. However, I > wanted to mention this proposal here for completeness. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira