[
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605205#comment-13605205
]
Dave Thompson commented on MAPREDUCE-5065:
--
Reviewed latest patch. Looks good. +1
DistCp should skip checksum comparisons if block-sizes are different on
source/target.
--
Key: MAPREDUCE-5065
URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Attachments: MAPREDUCE-5065.branch-0.23.patch,
MAPREDUCE-5065.branch-2.patch
When copying files between 2 clusters with different default block-sizes, one
sees that the copy fails with a checksum-mismatch, even though the files have
identical contents.
The reason is that on HDFS, a file's checksum is unfortunately a function of
the block-size of the file. So you could have 2 different files with
identical contents (but different block-sizes) have different checksums.
(Thus, it's also possible for DistCp to fail to copy files on the same
file-system, if the source-file's block-size differs from HDFS default, and
-pb isn't used.)
I propose that we skip checksum comparisons under the following conditions:
1. -skipCrc is specified.
2. File-size is 0 (in which case the call to the checksum-servlet is moot).
3. source.getBlockSize() != target.getBlockSize(), since the checksums are
guaranteed to differ in this case.
I have a patch for #3.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira