[ 
https://issues.apache.org/jira/browse/HDFS-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Burkhardt moved MAPREDUCE-1973 to HDFS-1402:
-------------------------------------------------

              Project: Hadoop HDFS  (was: Hadoop Map/Reduce)
                  Key: HDFS-1402  (was: MAPREDUCE-1973)
    Affects Version/s: 0.22.0
                           (was: 0.20.1)
                           (was: 0.20.2)

> Optimize input split creation
> -----------------------------
>
>                 Key: HDFS-1402
>                 URL: https://issues.apache.org/jira/browse/HDFS-1402
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 0.22.0
>         Environment: Intel Nehalem cluster running Red Hat.
>            Reporter: Paul Burkhardt
>            Priority: Minor
>         Attachments: HADOOP-1973.patch
>
>
> The input split returns the locations that host the file blocks in the split. 
> The locations are determined by the getBlockLocations method of the 
> filesystem client which requires a remote connection to the filesystem (i.e. 
> HDFS). The remote connection is made for each file in the entire input split. 
> For jobs with many input files the network connections dominate the cost of 
> writing the input split file.
> A job requests a listing of the input files from the remote filesystem and 
> creates a FileStatus object as a handle for each file in the listing. The 
> FileStatus object can be imbued with the necessary host information on the 
> remote end and passed to the client-side in the bulk return of the listing 
> request. A getHosts method of the FileStatus would then return the locations 
> for the blocks comprising that file and eliminate the need for another trip 
> to the remote filesystem.
> The INodeFile maintains the blocks for a file and is an obvious choice to be 
> the originator for the locations of that file. It is also available to the 
> FSDirectory which first creates the listing of FileStatus objects. We propose 
> that the block locations be generated by the INodeFile to instantiate the 
> FileStatus object during the getListing request.
> Our tests demonstrated a factor of 2000 speedup for approximately 60,000 
> input files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to