[ https://issues.apache.org/jira/browse/HDFS-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034712#comment-14034712 ]
Colin Patrick McCabe commented on HDFS-5546: -------------------------------------------- bq. Maybe I'm misunderstanding the description. Is this jira only trying to address a tiny race if the path existed when the command started, but disappeared before being listed? If yes, then FNF is exactly the correct behavior. After that, the stats you see being checked in the code are supposed to be from listStatus. The problem is that right now, we have a race between getting back a directory entry from {{listStatus}} on the parent directory, and calling {{listStatus}} on it. Think of the following interleaving: 1. Eddy issues "hadoop fs -ls -R /" 2. ls command calls {{listStatus( /)}} and gets back status_a, status_b, status_c 3. ls command uses status_a to print out a line describing /a 4. Colin removes directory a 5. ls command calls {{listStatus("/a")}} 6. {{FileNotFoundException}} aborts the whole ls command. Nothing else is printed. Basically, this makes the {{ls -R}} command unusable in situations where files are changing. From a user's perspective, this just translates to "{{ls -R}} is broken" since you effectively can't really use it. bq. If you are trying to make ls always forge ahead when it gets FNF while in a subdir, that has some peril associated with it. What if the item being listed isn't what was deleted? What if an ancestor directory was deleted? Should ls keep pounding on the NN to list every directory it thinks should be there? And then as it ascends back up the tree should keep trying to list other siblings it thinks should be there? We're never going to be able to provide a 100% consistent view of the filesystem via {{ls -R}}. HDFS simply doesn't have a way of getting back a snapshot an entire subtree (well, except HDFS snapshots, which I think we can all agree are overkill here.). You are going to need multiple calls to {{listDir}}, and things may change in between those calls. These are just the facts of life, and something we have to accept. After all, we can't even get back a 100% consistent view of a single large directory via {{listStatus}}. Large directories will need multiple {{listStatus}} RPC calls in between and something may have changed in between RPCs. The {{/bin/ls}} command on UNIX has similar issues. But clearly, despite the lack of snapshot consistency, people do find {{ls}} to be a useful command, though. Unless I'm missing something, there is no major harm if we just do forge ahead and try to call {{listStatus}} on subdirectories we retrieved from the previous {{listStatus}} call. The worst that can happen is we try to list something that isn't there and get an FNF which we ignore. We could also print out the FNFs, but I'm not sure what the user would do with this information. bq. This is not an acceptable patch. It's not ok to swallow the FNF and return a success exit code. We still throw an FNF if the directory where {{ls -R}} starts doesn't exist. It's just that we don't shut down the whole enterprise if something underneath that directory changes during our recursion. Does that make sense? > race condition crashes "hadoop ls -R" when directories are moved/removed > ------------------------------------------------------------------------ > > Key: HDFS-5546 > URL: https://issues.apache.org/jira/browse/HDFS-5546 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.2.0 > Reporter: Colin Patrick McCabe > Assignee: Lei (Eddy) Xu > Priority: Minor > Fix For: 3.0.0 > > Attachments: HDFS-5546.1.patch, HDFS-5546.2.000.patch, > HDFS-5546.2.001.patch, HDFS-5546.2.002.patch > > > This seems to be a rare race condition where we have a sequence of events > like this: > 1. org.apache.hadoop.shell.Ls calls DFS#getFileStatus on directory D. > 2. someone deletes or moves directory D > 3. org.apache.hadoop.shell.Ls calls PathData#getDirectoryContents(D), which > calls DFS#listStatus(D). This throws FileNotFoundException. > 4. ls command terminates with FNF -- This message was sent by Atlassian JIRA (v6.2#6252)