[ https://issues.apache.org/jira/browse/MAPREDUCE-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079165#comment-13079165 ]
Mithun Radhakrishnan commented on MAPREDUCE-2149: ------------------------------------------------- https://issues.apache.org/jira/browse/MAPREDUCE-2765 This rewrite does attempt to address setup-times (as well as copy performance). > Distcp : setup with update is too slow when latency is high > ----------------------------------------------------------- > > Key: MAPREDUCE-2149 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2149 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: distcp > Affects Versions: 0.20.2, 0.21.0 > Reporter: Raghu Angadi > Assignee: Raghu Angadi > Attachments: MAPREDUCE-2149.patch > > > If you run distcp with '-update' option, for _each of the files_ present on > source cluster setup invokes a separate RPC to destination cluster to fetch > file info. > Usually this overhead is not very noticeable when both cluster are > geographically close to each other. But if the latency is large, setup could > take couple of orders of magnitude longer. > E.g. : source has 10k directories, each with about 10 files, round trip > latency between source and destination is 75 ms (typical for coast-to-coast > clusters). > If we run distcp on source cluster, set up would take about _2.5 hours_ > irrespective of whether destination has these files or not. '-lsr' on the > same dest dir from source cluster would take up to 12 min (depending on how > many directories already exist on dest). > * A fairly simple fix to how setup() iterates should bring the set up time > to same as '-lsr'. I will have a patch for this.. (though 12 min is too > large). > * A more scalable option is to differ update check to mappers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira