[
https://issues.apache.org/jira/browse/CRUNCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778487#comment-16778487
]
Andrew Olson commented on CRUNCH-679:
-------------------------------------
Pull request,
https://github.com/apache/crunch/pull/20
> Improvements for usage of DistCp
> --------------------------------
>
> Key: CRUNCH-679
> URL: https://issues.apache.org/jira/browse/CRUNCH-679
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Reporter: Andrew Olson
> Assignee: Josh Wills
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> As a follow-up to CRUNCH-660 and CRUNCH-675, a handful of corrections and
> improvements have been identified during testing.
> * We need to preserve preferred part names, e.g. part-m-00000. Currently the
> DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile
> method, and would therefore create destination file names like out0-m-00000,
> which are problematic when there are multiple map-only jobs writing to the
> same target path. This can be achieved by providing a custom CopyListing
> implementation that is capable of dynamically renaming target paths based on
> a given mapping. Unfortunately a substantial amount of code duplication from
> the original SimpleCopyListing class is currently required in order to inject
> the necessary logic for modifying the sequence file entry keys. HADOOP-16147
> has been opened to allow it to be simplified in the future.
> * The handleOutputs implementation in HFileTarget is essentially identical to
> the one in FileTargetImpl that it overrides. We can remove it and just share
> the same code.
> * It could be useful to add a property for configuring the max DistCp task
> bandwidth, as the default (100 MB/s per task) may be too high for certain
> environments.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)