Sumit Kumar created HADOOP-10634:
------------------------------------

             Summary: Add recursive list apis to FileSystem to give 
implementations an opportunity for optimization
                 Key: HADOOP-10634
                 URL: https://issues.apache.org/jira/browse/HADOOP-10634
             Project: Hadoop Common
          Issue Type: Improvement
          Components: fs/s3
            Reporter: Sumit Kumar
             Fix For: 2.4.0


Currently different code flows in hadoop use recursive listing to discover 
files/folders in a given path. For example in FileInputFormat (both mapreduce 
and mapred implementations) this is done while calculating splits. They however 
do this by doing listing level by level. That means to discover files in 
/foo/bar means they do listing at /foo/bar first to get the immediate children, 
then make the same call on all immediate children for /foo/bar to discover 
their immediate children and so on. This doesn't scale well for fs 
implementations like s3 because every listStatus call ends up being a 
webservice call to s3. In cases where large number of files are considered for 
input, this makes getSplits() call slow. 

This patch adds a new set of recursive list apis that give opportunity to the 
s3 fs implementation to optimize. The behavior remains the same for other 
implementations (that is a default implementation is provided for other fs so 
they don't have to implement anything new). However for s3 it provides a simple 
change (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to