[jira] [Commented] (HDFS-3044) fsck move should be non-destructive by default

Eli Collins (Commented) (JIRA) Tue, 20 Mar 2012 10:32:03 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233591#comment-13233591
 ]


Eli Collins commented on HDFS-3044:
-----------------------------------

- Still needs a test the new behavior of fsck move, that it's not destructive 
(ie a test that covers move w/o delete, and asserts the source files are still 
there)
- Nit, if we're going to name the flag "doMove" (vs eg salvageCorruptFiles) 
please add a comment by the declaration that "doMove" doesn't actually do a 
move anymore (since it no longer deletes its a copy now)

Otherwise looks great!
                
> fsck move should be non-destructive by default
> ----------------------------------------------
>
>                 Key: HDFS-3044
>                 URL: https://issues.apache.org/jira/browse/HDFS-3044
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>            Reporter: Eli Collins
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-3044.002.patch
>
>
> The fsck move behavior in the code and originally articulated in HADOOP-101 
> is:
> {quote}Current failure modes for DFS involve blocks that are completely 
> missing. The only way to "fix" them would be to recover chains of blocks and 
> put them into lost+found{quote}
> A directory is created with the file name, the blocks that are accessible are 
> created as individual files in this directory, then the original file is 
> removed. 
> I suspect the rationale for this behavior was that you can't use files that 
> are missing locations, and copying the block as files at least makes part of 
> the files accessible. However this behavior can also result in permanent 
> dataloss. Eg:
> - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster 
> startup, files with blocks where all replicas are on these set of datanodes 
> are marked corrupt
> - Admin does fsck move, which deletes the "corrupt" files, saves whatever 
> blocks were available
> - The HW issues with datanodes are resolved, they are started and join the 
> cluster. The NN tells them to delete their blocks for the corrupt files since 
> the file was deleted. 
> I think we should:
> - Make fsck move non-destructive by default (eg just does a move into 
> lost+found)
> - Make the destructive behavior optional (eg "--destructive" so admins think 
> about what they're doing)
> - Provide better sanity checks and warnings, eg if you're running fsck and 
> not all the slaves have checked in (if using dfs.hosts) then fsck should 
> print a warning indicating this that an admin should have to override if they 
> want to do something destructive

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3044) fsck move should be non-destructive by default

Reply via email to