[
https://issues.apache.org/jira/browse/CRUNCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Olson updated CRUNCH-679:
--------------------------------
Fix Version/s: 1.0.0
> Improvements for usage of DistCp
> --------------------------------
>
> Key: CRUNCH-679
> URL: https://issues.apache.org/jira/browse/CRUNCH-679
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Reporter: Andrew Olson
> Assignee: Josh Wills
> Priority: Major
> Fix For: 1.0.0
>
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and
> improvements have been identified during testing.
> * We need to preserve preferred part names, e.g. part-m-00000. Currently the
> DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile
> method, and would therefore create destination file names like out0-m-00000,
> which are problematic when there are multiple map-only jobs writing to the
> same target path. This can be achieved by providing a custom CopyListing
> implementation that is capable of dynamically renaming target paths based on
> a given mapping. Unfortunately a substantial amount of code duplication from
> the original SimpleCopyListing class is currently required in order to inject
> the necessary logic for modifying the sequence file entry keys. HADOOP-16147
> has been opened to allow it to be simplified in the future.
> * The handleOutputs implementation in HFileTarget is essentially identical to
> the one in FileTargetImpl that it overrides. We can remove it and just share
> the same code.
> * It could be useful to add a property for configuring the max DistCp task
> bandwidth, as the default (100 MB/s per task) may be too high for certain
> environments.
> * The default of 1000 for max DistCp map tasks may be too high in some
> situations resulting in 503 Slow Down errors from S3 especially if there are
> multiple jobs writing into the same bucket. Reducing to 100 should help
> prevent issues along those lines while still providing adequate parallel
> throughput.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)