[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

Doug Cutting (JIRA) Thu, 14 Mar 2013 14:08:16 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602752#comment-13602752
 ]


Doug Cutting commented on MAPREDUCE-5065:
-----------------------------------------

I think we should instead probably instruct her to run with -pb, not -skipCrc.

Another option might be to implement a checksum that's blocksize-independent, 
for when block sizes are different.  Currently the file checksum works by 
taking the CRC32 for every 512 byte chunk of the block, combining these with 
MD5 into a single checksum for the block, then combining these with MD5 into a 
single checksum for the file.  The first combination is done at the Datanode 
(in DataXceiver#blockChecksum) and the second at the client (in 
DFSClient#getFileChecksum).  If instead the client could directly retrieve the 
list of CRC32s from the datanode then it could combine them into a 
blocksize-independent checksum (so long as blockSize is a multiple of 
bytesPerChecksum and bytesPerChecksum is the same between the filesystems, 
which is ordinarily the case).  Op.java already includes a READ_METADATA 
operation, presumably intended to return the CRC32s to the client, but it is 
not implemented.  We'd probably want to extend the getFileChecksum API to 
permit specifying the type of checksum requested, whether MD5MD5CRC32 or 
MD5CRC32.  This would be a significant effort and it touches core bits of HDFS 
so should not be approached lightly.
                
> DistCp should skip checksum comparisons if block-sizes are different on 
> source/target.
> --------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5065
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>    Affects Versions: 2.0.3-alpha, 0.23.5
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>         Attachments: MAPREDUCE-5065.branch23.patch, 
> MAPREDUCE-5065.branch2.patch
>
>
> When copying files between 2 clusters with different default block-sizes, one 
> sees that the copy fails with a checksum-mismatch, even though the files have 
> identical contents.
> The reason is that on HDFS, a file's checksum is unfortunately a function of 
> the block-size of the file. So you could have 2 different files with 
> identical contents (but different block-sizes) have different checksums. 
> (Thus, it's also possible for DistCp to fail to copy files on the same 
> file-system, if the source-file's block-size differs from HDFS default, and 
> -pb isn't used.)
> I propose that we skip checksum comparisons under the following conditions:
> 1. -skipCrc is specified.
> 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
> 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
> guaranteed to differ in this case.
> I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

Reply via email to