[ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381538#comment-16381538
 ] 

Ajay Kumar commented on HDFS-13056:
-----------------------------------

[~dennishuo], Thanks for working on this. I tested patch v3 in my local machine 
and all results were on expected lines. My testing was mainly around 
ReplicatedFileChecksums. With this patch i was able to validate crc32 file 
checksums from hdfs and local file system (in mac) as well.This is great 
improvement over existing state.
{code:java}
bin/hdfs dfs -Ddfs.checksum.type=CRC32 -put hadoop-3.0.1-src.tar.gz /tmp/
bin/hdfs dfs -Ddfs.checksum.type=CRC32 
-Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp/hadoop-3.0.1-src.tar.gz
/tmp/hadoop-3.0.1-src.tar.gz COMPOSITE-CRC32 027f5281
crc32 hadoop-3.0.1-src.tar.gz
027f5281
bin/hdfs dfs -Ddfs.checksum.type=CRC32 -put CentOS-7.0-amd64-gui.ova /tmp/
bin/hdfs dfs -Ddfs.checksum.type=CRC32 
-Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum 
/tmp/CentOS-7.0-amd64-gui.ova
/tmp/CentOS-7.0-amd64-gui.ova COMPOSITE-CRC32 b76339af
crc32 CentOS-7.0-amd64-gui.ova
b76339af
------------------------------------
bin/hdfs dfs -Ddfs.bytes-per-checksum=2048 -Ddfs.blocksize=67108864 
-Ddfs.checksum.type=CRC32 -put README.txt /tmp2/
bin/hdfs dfs -Ddfs.checksum.type=CRC32 
-Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp2/README.txt
/tmp2/README.txt COMPOSITE-CRC32 72e7cbce
crc32 README.txt
72e7cbce
bin/hdfs dfs -Ddfs.bytes-per-checksum=2048 -Ddfs.blocksize=67108864 
-Ddfs.checksum.type=CRC32 -put ~/Downloads/hadoop-3.0.1-src.tar.gz /tmp2/
bin/hdfs dfs -Ddfs.checksum.type=CRC32 
-Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum 
/tmp2/hadoop-3.0.1-src.tar.gz
/tmp2/hadoop-3.0.1-src.tar.gz COMPOSITE-CRC32 027f5281
crc32 hadoop-3.0.1-src.tar.gz
027f5281
bin/hdfs dfs -Ddfs.bytes-per-checksum=1024 -Ddfs.blocksize=67108864 
-Ddfs.checksum.type=CRC32 -put CentOS-7.0-amd64-gui.ova /tmp2/
bin/hdfs dfs -Ddfs.checksum.type=CRC32 
-Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum 
/tmp2/CentOS-7.0-amd64-gui.ova
/tmp2/CentOS-7.0-amd64-gui.ova COMPOSITE-CRC32 b76339af
crc32 CentOS-7.0-amd64-gui.ova
b76339af
{code}
Few comments:
 * Wrap call to {{CrcUtil.newSingleCrcWrapperFromByteArray}} in 
{{FileCheckSumHelper.StripedFileNonStripedChecksumComputer#tryDataNode L714}}, 
{{FileCheckSumHelper.ReplicatedFileChecksumComputer L522 and 
}}{{FileCheckSumHelper.BlockGroupNonStripedChecksumComputer#compute}} with 
{{LOG.isDebugEnabled()}} as it seems {{blockChecksumForDebug}} is used only for 
debugging.
 * Audience for some of the classes may include Yarn,Common as well 
(@InterfaceAudience.LimitedPrivate(value = \{ "Common", "HDFS", "MapReduce", 
"Yarn"})

 * Refactor BlockGroupNonStripedChecksumComputer#compute to move new 
functionality to separate function.
 * Rename function parameter {{BlockChecksumOptions}} to {{blockChecksumType}} 
in DataTransferProtocol#blockChecksum, , 
DataTransferProtocol#blockGroupChecksum, 
BlockGroupNonStripedChecksumComputer#BlockGroupNonStripedChecksumComputer.
 * CrcComposer#digest never throws IOException

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13056
>                 URL: https://issues.apache.org/jira/browse/HDFS-13056
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, distcp, erasure-coding, federation, hdfs
>    Affects Versions: 3.0.0
>            Reporter: Dennis Huo
>            Priority: Major
>         Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to