Zoran Dimitrijevic created HADOOP-11785: -------------------------------------------
Summary: Reduce number of listStatus operation in distcp buildListing() Key: HADOOP-11785 URL: https://issues.apache.org/jira/browse/HADOOP-11785 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 3.0.0 Reporter: Zoran Dimitrijevic Assignee: Zoran Dimitrijevic Priority: Minor Fix For: 3.0.0 Attachments: distcp-liststatus.patch Distcp was taking long time in copyListing.buildListing() for large source trees (I was using source of 1.5M files in a tree of about 50K directories). For input at s3 buildListing was taking more than one hour. I've noticed a performance bug in the current code which does listStatus twice for each directory which doubles number of RPCs in some cases (if most directories do not contain >1000 files). -- This message was sent by Atlassian JIRA (v6.3.4#6332)