[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633946#comment-13633946 ] Hudson commented on MAPREDUCE-5065: --- Integrated in Hadoop-Yarn-trunk #186 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/186/]) MAPREDUCE-5065. DistCp should skip checksum comparisons if block-sizes are different on source/target. Contributed by Mithun Radhakrishnan. (Revision 1468629) Result = SUCCESS kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1468629 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Fix For: 3.0.0, 2.0.5-beta, 0.23.8 Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. Edit: I've modified the fix to warn the user (instead of skipping the checksum-check). Skipping parity-checks is unsafe. The code now fails the copy, and suggests that the user either use -pb to preserve block-size, or consider -skipCrc (and forgo copy validation entirely). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633987#comment-13633987 ] Hudson commented on MAPREDUCE-5065: --- Integrated in Hadoop-Hdfs-0.23-Build #584 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/584/]) MAPREDUCE-5065. DistCp should skip checksum comparisons if block-sizes are different on source/target. Contributed by Mithun Radhakrishnan. (Revision 1468636) Result = UNSTABLE kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1468636 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java * /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java * /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Fix For: 3.0.0, 2.0.5-beta, 0.23.8 Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. Edit: I've modified the fix to warn the user (instead of skipping the checksum-check). Skipping parity-checks is unsafe. The code now fails the copy, and suggests that the user either use -pb to preserve block-size, or consider -skipCrc (and forgo copy validation entirely). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633999#comment-13633999 ] Hudson commented on MAPREDUCE-5065: --- Integrated in Hadoop-Hdfs-trunk #1375 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1375/]) MAPREDUCE-5065. DistCp should skip checksum comparisons if block-sizes are different on source/target. Contributed by Mithun Radhakrishnan. (Revision 1468629) Result = FAILURE kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1468629 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Fix For: 3.0.0, 2.0.5-beta, 0.23.8 Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. Edit: I've modified the fix to warn the user (instead of skipping the checksum-check). Skipping parity-checks is unsafe. The code now fails the copy, and suggests that the user either use -pb to preserve block-size, or consider -skipCrc (and forgo copy validation entirely). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634055#comment-13634055 ] Hudson commented on MAPREDUCE-5065: --- Integrated in Hadoop-Mapreduce-trunk #1402 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1402/]) MAPREDUCE-5065. DistCp should skip checksum comparisons if block-sizes are different on source/target. Contributed by Mithun Radhakrishnan. (Revision 1468629) Result = SUCCESS kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1468629 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Fix For: 3.0.0, 2.0.5-beta, 0.23.8 Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. Edit: I've modified the fix to warn the user (instead of skipping the checksum-check). Skipping parity-checks is unsafe. The code now fails the copy, and suggests that the user either use -pb to preserve block-size, or consider -skipCrc (and forgo copy validation entirely). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633462#comment-13633462 ] Hudson commented on MAPREDUCE-5065: --- Integrated in Hadoop-trunk-Commit #3618 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3618/]) MAPREDUCE-5065. DistCp should skip checksum comparisons if block-sizes are different on source/target. Contributed by Mithun Radhakrishnan. (Revision 1468629) Result = SUCCESS kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1468629 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. Edit: I've modified the fix to warn the user (instead of skipping the checksum-check). Skipping parity-checks is unsafe. The code now fails the copy, and suggests that the user either use -pb to preserve block-size, or consider -skipCrc (and forgo copy validation entirely). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633470#comment-13633470 ] Kihwal Lee commented on MAPREDUCE-5065: --- I've committed this to trunk, branch-2 and branch-0.23. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. Edit: I've modified the fix to warn the user (instead of skipping the checksum-check). Skipping parity-checks is unsafe. The code now fails the copy, and suggests that the user either use -pb to preserve block-size, or consider -skipCrc (and forgo copy validation entirely). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626624#comment-13626624 ] Kihwal Lee commented on MAPREDUCE-5065: --- The patch looks good to me. [~cutting], are you okay with the change? DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. Edit: I've modified the fix to warn the user (instead of skipping the checksum-check). Skipping parity-checks is unsafe. The code now fails the copy, and suggests that the user either use -pb to preserve block-size, or consider -skipCrc (and forgo copy validation entirely). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627076#comment-13627076 ] Doug Cutting commented on MAPREDUCE-5065: - +1 This looks great to me. Thanks! DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. Edit: I've modified the fix to warn the user (instead of skipping the checksum-check). Skipping parity-checks is unsafe. The code now fails the copy, and suggests that the user either use -pb to preserve block-size, or consider -skipCrc (and forgo copy validation entirely). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626263#comment-13626263 ] Hadoop QA commented on MAPREDUCE-5065: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577712/MAPREDUCE-5065.branch-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-distcp. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3511//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3511//console This message is automatically generated. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. Edit: I've modified the fix to warn the user (instead of skipping the checksum-check). Skipping parity-checks is unsafe. The code now fails the copy, and suggests that the user either use -pb to preserve block-size, or consider -skipCrc (and forgo copy validation entirely). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605205#comment-13605205 ] Dave Thompson commented on MAPREDUCE-5065: -- Reviewed latest patch. Looks good. +1 DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605626#comment-13605626 ] Kihwal Lee commented on MAPREDUCE-5065: --- Review comments: * Add a reasonable timeout to the test case. This is a relatively new rule. It applies even when you are modifying existing test cases. Please take account that tests may run on a slower hardware. * If we suggest -skipCrc along with -pb, we should probably inform users of the risk of skipping validation. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604815#comment-13604815 ] Hadoop QA commented on MAPREDUCE-5065: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12574096/MAPREDUCE-5065.branch-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 one of tests included doesn't have a timeout.{color} {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-distcp. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3424//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3424//console This message is automatically generated. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603418#comment-13603418 ] Mithun Radhakrishnan commented on MAPREDUCE-5065: - I'm with you on the need for a blocksize-independent checksum. I wasn't convinced that combining CRC32-checksums together to form a higher-level checksum could be correct. (Thanks for the explanation.) {quote} instruct her to run with -pb, not -skipCrc. {quote} Yep, that should take care of #2 (above), but not #1. The user will still need to fail first and rerun, because she's unlikely to know that some of her source-files might have non-default block-sizes. Unless the checksum calculation is fixed (or -pb is default), I don't think DistCp should enforce a check that's a guaranteed failure, under unforeseeable circumstances. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch23.patch, MAPREDUCE-5065.branch2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603437#comment-13603437 ] Hadoop QA commented on MAPREDUCE-5065: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12573882/MAPREDUCE-5065.branch-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-distcp. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3419//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3419//console This message is automatically generated. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch-0.23.patch, MAPREDUCE-5065.branch-2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603493#comment-13603493 ] Kihwal Lee commented on MAPREDUCE-5065: --- bq. Another option might be to implement a checksum that's blocksize-independent... Reading whole metadata may be too much, especially for huge files. It will be better if we make computation happen where the data is. :) Most hashing is incremental, so if DFSClient feeds the last state of hash into the next datanode and let it continue updating it, the result will be independent of block size. The current way of doing file checksum allows calculating individual block checksums in parallel, but we are not taking advantage of it in DFSClient anyway. So I don't think there won't be any significant changes in performance or overhead. We should probably continue this discussion in a separate jira. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603494#comment-13603494 ] Kihwal Lee commented on MAPREDUCE-5065: --- bq. So I don't think there won't be any significant changes in performance or overhead. Sorry, unintended double negation. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603502#comment-13603502 ] Kihwal Lee commented on MAPREDUCE-5065: --- Filed HDFS-4605 for block-size independent FileChecksum in HDFS. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602538#comment-13602538 ] Doug Cutting commented on MAPREDUCE-5065: - This seems like it could give false comfort. Rather it would be safer to advise people to, when they attempt to copy files with different block sizes, to either specify -pb or -skipCrc. So better documentation, warnings and error messages might suffice. Then the results of a distcp could still be trusted unless you've explicitly specified -skipCrc. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch23.patch, MAPREDUCE-5065.branch2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602647#comment-13602647 ] Mithun Radhakrishnan commented on MAPREDUCE-5065: - Hello, Doug. Thank you for looking at this. (For the moment, let's ignore that while DistCp code in 2.0/ does honour -skipCrc, 0.23/ code does not. I'll update the 0.23 patch to bring both of these to parity.) IMO, it will not suffice to only document this in docs/code/warning-messages: 1. The user isn't likely to realize that the default block-sizes differ between source and target. She is even less likely to perceive the difference if the block-sizes on the source-files were explicitly set to a non-default value. (And that's entirely possible with FileSystem.create().) The most likely manner in which she'd notice is when DistCp fails on checksum-diff, at which point the warning would instruct her to -skipCrc on the rerun. 2. Using -skipCrc will disable checksum-checks on all files copied. It's preferable to apply checks where we can, and skip only where block-sizes differ (because that's a guaranteed failure.) One alternative is to make -pb/-skipCrc default, but that's undesirable as well. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch23.patch, MAPREDUCE-5065.branch2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602752#comment-13602752 ] Doug Cutting commented on MAPREDUCE-5065: - I think we should instead probably instruct her to run with -pb, not -skipCrc. Another option might be to implement a checksum that's blocksize-independent, for when block sizes are different. Currently the file checksum works by taking the CRC32 for every 512 byte chunk of the block, combining these with MD5 into a single checksum for the block, then combining these with MD5 into a single checksum for the file. The first combination is done at the Datanode (in DataXceiver#blockChecksum) and the second at the client (in DFSClient#getFileChecksum). If instead the client could directly retrieve the list of CRC32s from the datanode then it could combine them into a blocksize-independent checksum (so long as blockSize is a multiple of bytesPerChecksum and bytesPerChecksum is the same between the filesystems, which is ordinarily the case). Op.java already includes a READ_METADATA operation, presumably intended to return the CRC32s to the client, but it is not implemented. We'd probably want to extend the getFileChecksum API to permit specifying the type of checksum requested, whether MD5MD5CRC32 or MD5CRC32. This would be a significant effort and it touches core bits of HDFS so should not be approached lightly. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch23.patch, MAPREDUCE-5065.branch2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
[ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13601937#comment-13601937 ] Hadoop QA commented on MAPREDUCE-5065: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12573645/MAPREDUCE-5065.branch23.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3412//console This message is automatically generated. DistCp should skip checksum comparisons if block-sizes are different on source/target. -- Key: MAPREDUCE-5065 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Mithun Radhakrishnan Assignee: Mithun Radhakrishnan Attachments: MAPREDUCE-5065.branch23.patch, MAPREDUCE-5065.branch2.patch When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents. The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.) I propose that we skip checksum comparisons under the following conditions: 1. -skipCrc is specified. 2. File-size is 0 (in which case the call to the checksum-servlet is moot). 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case. I have a patch for #3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira