Hi, I recently did some investigation about DistCp and have some questions. I thought before diving into JIRA things it would be good to discuss them first here.
I read the doc at the following link and regard it as the latest revision that corresponds with the trunk codebase. http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html If that's right, then we may need to complement it with the following important features because I don't see they are mentioned in the doc. 1. -diff option, use snapshot diff report to identify the differences between source and target to compute the copying list. 2. -numListstatusThreads option, number of threads to concurrently compute the copying list. 3. -p t, to preserve timestamps. As above features are great things for user to use in order to speed up the time consuming inter or intra cluster sync, not only to add these options in the table of command line options, but also better to document them well as we did for other functions. A main use case is that performing copy from source HDFS cluster to target HDFS cluster. It was mentioned each NodeManager can reach and communicate with both the source and destination file systems. In this case where is recommended to run the DistCp command, in the source cluster or target? Might be better to run it in the source side so copy mappers can read locally via short circuit (but would then write remotely)? Any consideration in this aspect? In above case (both source and target are HDFS cluster), there was a consideration for replicated files that, if the block size and checksum opt are not reserved (via -pb), then after copy is done we may skip the file checksums comparing and avoid the checksum computing, because in such situation, since block size and checksum type may differ, then the file checksums surely differ. Sure, in most time source and target clusters may use the same setting, so even not preserved, I guess the block size and checksum type may still be the same particularly by default values. So more safely, maybe we can improve this as, compare the block size and checksum opt first, if they're the same, then compare the file checksums, otherwise not. Makes sense? Note this is partly raised in HDFS-9613. For striped files, we'll need to update the command as well, and probably handle it specially. This is currently under discussion in HDFS-8430. Thanks for the discussion. Regards, Kai