[ https://issues.apache.org/jira/browse/MAPREDUCE-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161003#comment-13161003 ]
Mithun Radhakrishnan commented on MAPREDUCE-2765: ------------------------------------------------- Sorry I missed this before, but it looks like this bug has been made a blocker for 6448. The intention was for the new DistCp to replace the original one, but not in the immediate time-frame. (It'll take time to port this not to use FileSystem.) Which is also why old-distcp was left untouched in this patch. I'd appreciate comment/advice on whether this patch fits the bill, otherwise. > DistCp Rewrite > -------------- > > Key: MAPREDUCE-2765 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2765 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: distcp, mrv2 > Affects Versions: 0.20.203.0 > Reporter: Mithun Radhakrishnan > Assignee: Mithun Radhakrishnan > Priority: Critical > Attachments: distcpv2.20.203.patch, distcpv2_hadoop-0.23.1.patch, > distcpv2_hadoop-0.23.1.patch, distcpv2_hadoop-trunk.patch, > distcpv2_trunk.patch, distcpv2_trunk_post_review_1.patch > > > This is a slightly modified version of the DistCp rewrite that Yahoo uses in > production today. The rewrite was ground-up, with specific focus on: > 1. improved startup time (postponing as much work as possible to the MR job) > 2. support for multiple copy-strategies > 3. new features (e.g. -atomic, -async, -bandwidth.) > 4. improved programmatic use > Some effort has gone into refactoring what used to be achieved by a single > large (1.7 KLOC) source file, into a design that (hopefully) reads better too. > The proposed DistCpV2 preserves command-line-compatibility with the old > version, and should be a drop-in replacement. > New to v2: > 1. Copy-strategies and the DynamicInputFormat: > A copy-strategy determines the policy by which source-file-paths are > distributed between map-tasks. (These boil down to the choice of the > input-format.) > If no strategy is explicitly specified on the command-line, the policy > chosen is "uniform size", where v2 behaves identically to old-DistCp. (The > number of bytes transferred by each map-task is roughly equal, at a per-file > granularity.) > Alternatively, v2 ships with a "dynamic" copy-strategy (in the > DynamicInputFormat). This policy acknowledges that > (a) dividing files based only on file-size might not be an > even distribution (E.g. if some datanodes are slower than others, or if some > files are skipped.) > (b) a "static" association of a source-path to a map increases > the likelihood of long-tails during copy. > The "dynamic" strategy divides the list-of-source-paths into a number > (> nMaps) of smaller parts. When each map completes its current list of > paths, it picks up a new list to process, if available. So if a map-task is > stuck on a slow (and not necessarily large) file, other maps can pick up the > slack. The thinner the file-list is sliced, the greater the parallelism (and > the lower the chances of long-tails). Within reason, of course: the number of > these short-lived list-files is capped at an overridable maximum. > Internal benchmarks against source/target clusters with some slow(ish) > datanodes have indicated significant performance gains when using the > dynamic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps. > Please note that the DynamicInputFormat might prove useful outside of > DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also > note that the copy-strategies have no bearing on the CopyMapper.map() > implementation. > > 2. Improved startup-time and programmatic use: > When the old-DistCp runs with -update, and creates the > list-of-source-paths, it attempts to filter out files that might be skipped > (by comparing file-sizes, checksums, etc.) This significantly increases the > startup time (or the time spent in serial processing till the MR job is > launched), blocking the calling-thread. This becomes pronounced as nFiles > increases. (Internal benchmarks have seen situations where more time is spent > setting up the job than on the actual transfer.) > DistCpV2 postpones as much work as possible to the MR job. The > file-listing isn't filtered until the map-task runs (at which time, identical > files are skipped). DistCpV2 can now be run "asynchronously". The program > quits at job-launch, logging the job-id for tracking. Programmatically, the > DistCp.execute() returns a Job instance for progress-tracking. > > 3. New features: > (a) -async: As described in #2. > (b) -atomic: Data is copied to a (user-specifiable) tmp-location, and > then moved atomically to destination. > (c) -bandwidth: Enforces a limit on the bandwidth consumed per map. > (d) -strategy: As above. > > A more comprehensive description the newer features, how the dynamic-strategy > works, etc. is available in src/site/xdoc/, and in the pdf that's generated > therefrom, during the build. > High on the list of things to do is support to parallelize copies on a > per-block level. (i.e. Incorporation of HDFS-222.) > I look forward to comments, suggestions and discussion that will hopefully > ensue. I have this running against Hadoop 0.20.203.0. I also have a port to > 0.23.0 (complete with unit-tests). > P.S. > A tip of the hat to Srikanth (Sundarrajan) and Venkatesh (Seetharamaiah), for > ideas, code, reviews and guidance. Although much of the code is mine, the > idea to use the DFS to implement "dynamic" input-splits wasn't. > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira