[ 
https://issues.apache.org/jira/browse/HADOOP-18629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693210#comment-17693210
 ] 

ASF GitHub Bot commented on HADOOP-18629:
-----------------------------------------

steveloughran commented on PR #5391:
URL: https://github.com/apache/hadoop/pull/5391#issuecomment-1443728823

   This is interesting. I know there are deployments without hdfs around (e.g. 
azure clusters), but do see that the import is there for snapshot updates (hdfs 
only) with explicit imports of SnapshotDiffReport. tracing it back, if you use 
"-diff" on the command line then hdfs *must* be on the classpath.
   
   your link flags up that it is already in copymapper; looking at that I don't 
see it in branch-3.3; it came in with HADOOP-14254. 
   
   Given it is in there on a codepath hit with the option to copy erasure 
policy (and skipped if not), then again, provided the change goes in such that 
it is optional, your patch isn't going to force in a new run-time dependency, 
is it?
   
   let me look at the new patch some more without worrying about that 
detail...the builder API is public. 
   
   Be aware that we are always very nervous about touching distcp because it is 
fairly old and brittle code that is used incredibly broadly -not just on the 
command line but actually at the Java API from applications like hive. I think 
this is fairly low risk but will highlight the JIRA on the HDFS mailing list so 
they can review it to.




> Hadoop DistCp supports specifying favoredNodes for data copying
> ---------------------------------------------------------------
>
>                 Key: HADOOP-18629
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18629
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: tools/distcp
>    Affects Versions: 3.3.4
>            Reporter: zhuyaogai
>            Priority: Major
>              Labels: pull-request-available
>
> When importing large scale data to HBase, we always generate the hfiles with 
> other Hadoop cluster, use the Distcp tool to copy the data to the HBase 
> cluster, and bulkload data to HBase table. However, the data locality is 
> rather low which may result in high query latency. After taking a compaction 
> it will recover. Therefore, we can increase the data locality by specifying 
> the favoredNodes in Distcp.
> Could I submit a pull request to optimize it?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to