Some questions about DistCp

Zheng, Kai Thu, 07 Jan 2016 23:49:08 -0800

Hi,

I recently did some investigation about DistCp and have some questions. I 
thought before diving into JIRA things it would be good to discuss them first 
here.


I read the doc at the following link and regard it as the latest revision that 
corresponds with the trunk codebase.
http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
If that's right, then we may need to complement it with the following important 
features because I don't see they are mentioned in the doc.

1.       -diff option, use snapshot diff report to identify the differences 
between source and target to compute the copying list.

2.       -numListstatusThreads option, number of threads to concurrently 
compute the copying list.

3.       -p t, to preserve timestamps.
As above features are great things for user to use in order to speed up the 
time consuming inter or intra cluster sync, not only to add these options in 
the table of command line options, but also better to document them well as we 
did for other functions.

A main use case is that performing copy from source HDFS cluster to target HDFS 
cluster. It was mentioned each NodeManager can reach and communicate with both 
the source and destination file systems. In this case where is recommended to 
run the DistCp command, in the source cluster or target? Might be better to run 
it in the source side so copy mappers can read locally via short circuit (but 
would then write remotely)? Any consideration in this aspect?

In above case (both source and target are HDFS cluster), there was a 
consideration for replicated files that, if the block size and checksum opt are 
not reserved (via -pb), then after copy is done we may skip the file checksums 
comparing and avoid the checksum computing, because in such situation, since 
block size and checksum type may differ, then the file checksums surely differ. 
Sure, in most time source and target clusters may use the same setting, so even 
not preserved, I guess the block size and checksum type may still be the same 
particularly by default values.  So more safely, maybe we can improve this as, 
compare the block size and checksum opt first, if they're the same, then 
compare the file checksums, otherwise not. Makes sense? Note this is partly 
raised in HDFS-9613.

For striped files, we'll need to update the command as well, and probably 
handle it specially. This is currently under discussion in HDFS-8430.

Thanks for the discussion.

Regards,
Kai

Some questions about DistCp

Reply via email to