Andrew Olson created CRUNCH-679:
-----------------------------------
Summary: Improvements for usage of DistCp
Key: CRUNCH-679
URL: https://issues.apache.org/jira/browse/CRUNCH-679
Project: Crunch
Issue Type: Improvement
Components: Core
Reporter: Andrew Olson
Assignee: Josh Wills
As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and
improvements have been identified during testing.
* We need to preserve preferred part names, e.g. part-m-00000. Currently the
DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile
method, and would therefore create destination file names like out0-m-00000,
which are problematic when there are multiple map-only jobs writing to the same
target path. This can be achieved by providing a custom CopyListing
implementation that is capable of dynamically renaming target paths based on a
given mapping. Unfortunately a substantial amount of code duplication from the
original SimpleCopyListing class is currently required in order to inject the
necessary logic for modifying the sequence file entry keys. HADOOP-16147 has
been opened to allow it to be simplified in the future.
* The handleOutputs implementation in HFileTarget is essentially identical to
the one in FileTargetImpl that it overrides. We can remove it and just share
the same code.
* It could be useful to add a property for configuring the max DistCp task
bandwidth, as the default (100 MB/s per task) may be too high for certain
environments.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)