Zoran Dimitrijevic created HADOOP-11785:
-------------------------------------------

             Summary: Reduce number of listStatus operation in distcp 
buildListing()
                 Key: HADOOP-11785
                 URL: https://issues.apache.org/jira/browse/HADOOP-11785
             Project: Hadoop Common
          Issue Type: Improvement
          Components: tools/distcp
    Affects Versions: 3.0.0
            Reporter: Zoran Dimitrijevic
            Assignee: Zoran Dimitrijevic
            Priority: Minor
             Fix For: 3.0.0
         Attachments: distcp-liststatus.patch

Distcp was taking long time in copyListing.buildListing() for large source 
trees (I was using source of 1.5M files in a tree of about 50K directories). 
For input at s3 buildListing was taking more than one hour. I've noticed a 
performance bug in the current code which does listStatus twice for each 
directory which doubles number of RPCs in some cases (if most directories do 
not contain >1000 files).

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to