[jira] [Commented] (HDFS-3004) Implement Recovery Mode

Eli Collins (Commented) (JIRA) Wed, 07 Mar 2012 12:51:29 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224698#comment-13224698
 ]


Eli Collins commented on HDFS-3004:
-----------------------------------

Overall approach looks good.

- Wr edit logs in the namenode directory that "seem" to have a higher txid than 
the current txid, isn't the idea that we have an option to actually truncate 
the last edit from the log? Ie in this patch you're asking if the user would 
like to truncate but not actually truncating 
- Is the move of the re-check of maxSeenTxid cleanup or actually necessary now? 
I agree the re-check doesn't look necessary though now we bail before adding 
found images if we can't find the maxSeenTxId in the SD images, not sure that's 
OK.
- logTruncateMessage should probably be WARN instead of ERROR since we're doing 
it intentionally (ie this code path isn't an error case), but we want it to 
have a high log level so we always see it.
- In the arg checking loop can just test for one additional argument rather 
than looping since we only support 1 argument
- Looks like loadEditRecords used to throw EditLogInputException in cases it 
now throws IOE. Also, let's pull the recovery code out to a separate method vs 
implementing inline in the catch block. It may even make sense to have a 
separate loadEditRecordsWithRecovery method
- Needs some more test cases, eg w/ and w/o yes to all, and that if you restart 
the cluster after the recovery the fs state matches the intended state (ie if 
the last edit created a file check that file is not present, but the rest of 
the state is in order)
- Easier if RecoveryContext#ask used var args?
- New files need the apache license header
- Testing?  Aside from running the tests would be good to try from a tarball 
install and start the NN with recovery, check the various options

Style nits:
- I'd rename "yesToAll" to something like "recoverYesToAll"
so its clear that its recovery related
- Method declarations should have an empty line between them
- would rename EditLogInputStream var "l" "editIn" to be consistent with the 
rest of the file. And long "e" somethign more descriptive like "txId"
- Both brackets go on the same line in else and catch clauses (eg "} else {", 
eg "} catch (..) {"
- "can't understand" and "e.getMessage()" lines need indentation
- use postfix increment to be consistent (eg txId++ vs ++txId) when it doesn't 
matter
- the opening bracket for a method goes on the same line as the throws clause 
(eg "throws IOE {")
                
> Implement Recovery Mode
> -----------------------
>
>                 Key: HDFS-3004
>                 URL: https://issues.apache.org/jira/browse/HDFS-3004
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: tools
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-3004.006.patch, 
> HDFS-3004__namenode_recovery_tool.txt
>
>
> When the NameNode metadata is corrupt for some reason, we want to be able to 
> fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
> world, we never would.  However, bad data on disk can happen from time to 
> time, because of hardware errors or misconfigurations.  In the past we have 
> had to correct it manually, which is time-consuming and which can result in 
> downtime.
> Recovery mode is initialized by the system administrator.  When the NameNode 
> starts up in Recovery Mode, it will try to load the FSImage file, apply all 
> the edits from the edits log, and then write out a new image.  Then it will 
> shut down.
> Unlike in the normal startup process, the recovery mode startup process will 
> be interactive.  When the NameNode finds something that is inconsistent, it 
> will prompt the operator as to what it should do.   The operator can also 
> choose to take the first option for all prompts by starting up with the '-f' 
> flag, or typing 'a' at one of the prompts.
> I have reused as much code as possible from the NameNode in this tool.  
> Hopefully, the effort that was spent developing this will also make the 
> NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3004) Implement Recovery Mode

Reply via email to