[jira] [Commented] (HDFS-5546) race condition crashes "hadoop ls -R" when directories are moved/removed

Colin Patrick McCabe (JIRA) Tue, 17 Jun 2014 18:44:29 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034712#comment-14034712
 ]


Colin Patrick McCabe commented on HDFS-5546:
--------------------------------------------

bq. Maybe I'm misunderstanding the description. Is this jira only trying to 
address a tiny race if the path existed when the command started, but 
disappeared before being listed? If yes, then FNF is exactly the correct 
behavior. After that, the stats you see being checked in the code are supposed 
to be from listStatus.

The problem is that right now, we have a race between getting back a directory 
entry from {{listStatus}} on the parent directory, and calling {{listStatus}} 
on it.  Think of the following interleaving:

1. Eddy issues "hadoop fs -ls -R /"
2. ls command calls {{listStatus( /)}} and gets back status_a, status_b, 
status_c
3. ls command uses status_a to print out a line describing /a
4. Colin removes directory a
5. ls command calls {{listStatus("/a")}}
6. {{FileNotFoundException}} aborts the whole ls command.  Nothing else is 
printed.

Basically, this makes the {{ls -R}} command unusable in situations where files 
are changing.  From a user's perspective, this just translates to "{{ls -R}} is 
broken" since you effectively can't really use it.

bq. If you are trying to make ls always forge ahead when it gets FNF while in a 
subdir, that has some peril associated with it. What if the item being listed 
isn't what was deleted? What if an ancestor directory was deleted? Should ls 
keep pounding on the NN to list every directory it thinks should be there? And 
then as it ascends back up the tree should keep trying to list other siblings 
it thinks should be there?

We're never going to be able to provide a 100% consistent view of the 
filesystem via {{ls -R}}.  HDFS simply doesn't have a way of getting back a 
snapshot an entire subtree (well, except HDFS snapshots, which I think we can 
all agree are overkill here.).  You are going to need multiple calls to 
{{listDir}}, and things may change in between those calls.  These are just the 
facts of life, and something we have to accept.

After all, we can't even get back a 100% consistent view of a single large 
directory via {{listStatus}}.  Large directories will need multiple 
{{listStatus}} RPC calls in between and something may have changed in between 
RPCs.  The {{/bin/ls}} command on UNIX has similar issues.  But clearly, 
despite the lack of snapshot consistency, people do find {{ls}} to be a useful 
command, though.

Unless I'm missing something, there is no major harm if we just do forge ahead 
and try to call {{listStatus}} on subdirectories we retrieved from the previous 
{{listStatus}} call.  The worst that can happen is we try to list something 
that isn't there and get an FNF which we ignore.  We could also print out the 
FNFs, but I'm not sure what the user would do with this information.

bq. This is not an acceptable patch. It's not ok to swallow the FNF and return 
a success exit code.

We still throw an FNF if the directory where {{ls -R}} starts doesn't exist.  
It's just that we don't shut down the whole enterprise if something underneath 
that directory changes during our recursion.  Does that make sense?

> race condition crashes "hadoop ls -R" when directories are moved/removed
> ------------------------------------------------------------------------
>
>                 Key: HDFS-5546
>                 URL: https://issues.apache.org/jira/browse/HDFS-5546
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Lei (Eddy) Xu
>            Priority: Minor
>             Fix For: 3.0.0
>
>         Attachments: HDFS-5546.1.patch, HDFS-5546.2.000.patch, 
> HDFS-5546.2.001.patch, HDFS-5546.2.002.patch
>
>
> This seems to be a rare race condition where we have a sequence of events 
> like this:
> 1. org.apache.hadoop.shell.Ls calls DFS#getFileStatus on directory D.
> 2. someone deletes or moves directory D
> 3. org.apache.hadoop.shell.Ls calls PathData#getDirectoryContents(D), which 
> calls DFS#listStatus(D). This throws FileNotFoundException.
> 4. ls command terminates with FNF



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-5546) race condition crashes "hadoop ls -R" when directories are moved/removed

Reply via email to