[ 
https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491170#comment-14491170
 ] 

Zoran Dimitrijevic commented on HADOOP-11827:
---------------------------------------------

Performance results and charts for dataset I used (1.5M files and approx 50K 
dirs): 

https://docs.google.com/spreadsheets/d/1qJfO9ZhPXuGCpHyfX1NLE0Zm_NB39gn-cELECShd_zk/edit#gid=0

Please note that there are two sheets (s3n -> hdfs and hdfs -> hdfs). Main 
improvement is when source is in s3. Improvements when source is hdfs is good 
as well, but since current distcp has to sort input file total improvement is 
not as important). 

TODO: We can sort only directories which would further improve startup time.

> Speed-up distcp buildListing() using threadpool
> -----------------------------------------------
>
>                 Key: HADOOP-11827
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11827
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 3.0.0
>            Reporter: Zoran Dimitrijevic
>            Assignee: Zoran Dimitrijevic
>            Priority: Minor
>         Attachments: HADOOP-11827.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For very large source trees on s3 distcp is taking long time to build file 
> listing (client code, before starting mappers). For a dataset I used (1.5M 
> files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 
> 36 minutes after the fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to