[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

Hairong Kuang (JIRA) Fri, 23 Jul 2010 12:13:18 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891743#action_12891743
 ]


Hairong Kuang commented on HDFS-202:
------------------------------------

I am not sure what should we do if a child of the input directory is a symbolic 
link. Whether the symbolic link should be resolved or not better to be decided 
by applications.

It seems cleaner if the new API changes to be listLocatedFileStatus(Path path) 
so it does not traverse the subtree recursively and it returns all the content 
of the directory. BlockLocations are piggybacked if a child is a file. This 
design decision leaves the questions like how to deal with when a child is a 
symbolic link or a directory to be answered by applications. 

> Add a bulk FIleSystem.getFileBlockLocations
> -------------------------------------------
>
>                 Key: HDFS-202
>                 URL: https://issues.apache.org/jira/browse/HDFS-202
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Arun C Murthy
>            Assignee: Hairong Kuang
>             Fix For: 0.22.0
>
>         Attachments: hdfsListFiles.patch
>
>
> Currently map-reduce applications (specifically file-based input-formats) use 
> FileSystem.getFileBlockLocations to compute splits. However they are forced 
> to call it once per file.
> The downsides are multiple:
>    # Even with a few thousand files to process the number of RPCs quickly 
> starts getting noticeable
>    # The current implementation of getFileBlockLocations is too slow since 
> each call results in 'search' in the namesystem. Assuming a few thousand 
> input files it results in that many RPCs and 'searches'.
> It would be nice to have a FileSystem.getFileBlockLocations which can take in 
> a directory, and return the block-locations for all files in that directory. 
> We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
> When I tested this for terasort, a moderate job with 8000 input files the 
> runtime halved from the current 8s to 4s. Clearly this is much more important 
> for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

Reply via email to