[ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---------------------------------------

    Attachment:     (was: HDFS-3004__namenode_recovery_tool.txt)
    
> Create Offline NameNode recovery tool
> -------------------------------------
>
>                 Key: HDFS-3004
>                 URL: https://issues.apache.org/jira/browse/HDFS-3004
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: tools
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-3004__namenode_recovery_tool.txt
>
>
> We've been talking about creating a tool which can process NameNode edit logs 
> and image files offline.
> This tool would be similar to a fsck for a conventional filesystem.  It would 
> detect inconsistencies and malformed data.  In cases where it was possible, 
> and the operator asked for it, it would try to correct the inconsistency.
> It's probably better to call this "nameNodeRecovery" or similar, rather than 
> "fsck," since we already have a separate and unrelated mechanism which we 
> refer to as fsck.
> The use case here is that the NameNode data is corrupt for some reason, and 
> we want to fix it.  Obviously, we would prefer never to get in this case.  In 
> a perfect world, we never would.  However, bad data on disk can happen from 
> time to time, because of hardware errors or misconfigurations.  In the past 
> we have had to correct it manually, which is time-consuming and which can 
> result in downtime.
> I would like to reuse as much code as possible from the NameNode in this 
> tool.  Hopefully, the effort that is spent developing this will also make the 
> NameNode editLog and image processing even more robust than it already is.
> Another approach that we have discussed is NOT having an offline tool, but 
> just having a switch supplied to the NameNode, like "—auto-fix" or 
> "—force-fix".  In that case, the NameNode would attempt to "guess" when data 
> was missing or incomplete in the EditLog or Image-- rather than aborting as 
> it does now.  Like the proposed fsck tool, this switch could be used to get 
> users back on their feet quickly after a problem developed.  I am not in 
> favor of this approach, because there is a danger that users could supply 
> this flag in cases where it is not appropriate.  This risk does not exist for 
> an offline fsck tool, since it would have to be run explicitly.  However, I 
> wanted to mention this proposal here for completeness.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to