[ https://issues.apache.org/jira/browse/HDFS-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Burkhardt updated HDFS-1402: --------------------------------- Attachment: HDFS-1402.patch HDFS-1402.common.patch Patched against the trunk. > Optimize input split creation > ----------------------------- > > Key: HDFS-1402 > URL: https://issues.apache.org/jira/browse/HDFS-1402 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 0.22.0 > Environment: Intel Nehalem cluster running Red Hat. > Reporter: Paul Burkhardt > Priority: Minor > Attachments: HADOOP-1973.patch, HDFS-1402.common.patch, > HDFS-1402.patch > > > The input split returns the locations that host the file blocks in the split. > The locations are determined by the getBlockLocations method of the > filesystem client which requires a remote connection to the filesystem (i.e. > HDFS). The remote connection is made for each file in the entire input split. > For jobs with many input files the network connections dominate the cost of > writing the input split file. > A job requests a listing of the input files from the remote filesystem and > creates a FileStatus object as a handle for each file in the listing. The > FileStatus object can be imbued with the necessary host information on the > remote end and passed to the client-side in the bulk return of the listing > request. A getHosts method of the FileStatus would then return the locations > for the blocks comprising that file and eliminate the need for another trip > to the remote filesystem. > The INodeFile maintains the blocks for a file and is an obvious choice to be > the originator for the locations of that file. It is also available to the > FSDirectory which first creates the listing of FileStatus objects. We propose > that the block locations be generated by the INodeFile to instantiate the > FileStatus object during the getListing request. > Our tests demonstrated a factor of 2000 speedup for approximately 60,000 > input files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.