[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603493#comment-13603493
 ] 

Kihwal Lee commented on MAPREDUCE-5065:
---------------------------------------

bq. Another option might be to implement a checksum that's 
blocksize-independent...

Reading whole metadata may be too much, especially for huge files. It will be 
better if we make computation happen where the data is. :)
 
Most hashing is incremental, so if DFSClient feeds the last state of hash into 
the next datanode and let it continue updating it, the result will be 
independent of block size. The current way of doing file checksum allows 
calculating individual block checksums in parallel, but we are not taking 
advantage of it in DFSClient anyway. So I don't think there won't be any 
significant changes in performance or overhead.

We should probably continue this discussion in a separate jira.
                
> DistCp should skip checksum comparisons if block-sizes are different on 
> source/target.
> --------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5065
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>    Affects Versions: 2.0.3-alpha, 0.23.5
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>
> When copying files between 2 clusters with different default block-sizes, one 
> sees that the copy fails with a checksum-mismatch, even though the files have 
> identical contents.
> The reason is that on HDFS, a file's checksum is unfortunately a function of 
> the block-size of the file. So you could have 2 different files with 
> identical contents (but different block-sizes) have different checksums. 
> (Thus, it's also possible for DistCp to fail to copy files on the same 
> file-system, if the source-file's block-size differs from HDFS default, and 
> -pb isn't used.)
> I propose that we skip checksum comparisons under the following conditions:
> 1. -skipCrc is specified.
> 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
> 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
> guaranteed to differ in this case.
> I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to