[jira] [Commented] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

Dennis Huo (JIRA) Fri, 26 Jan 2018 19:43:28 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341933#comment-16341933
 ]


Dennis Huo commented on HDFS-13056:
-----------------------------------

Thanks for sharing! I was also weighing whether to modify OpBlockChecksum in 
place, and I only decided to initially go with adding the new Op just to make 
it easy to verify at a glance that the old code path is 100% unchanged, since I 
figured CRCs are sensitive for existing deployments, and folks may want to 
minimize risk of breakage even if it costs a bit more boilerplate code. But I 
can also see the appeal of putting it in the existing OpBlockChecksum; either 
way we can add protections to preserve default behavior. I'm open to input if 
you or anyone else feels strongly about this.

When I was thinking about how to expose the options, it was difficult to come 
up with an entirely satisfactory approach, so it wasn't entirely obvious 
whether it's better to have completely orthogonal "combine mode" vs "data 
checksum types" determine an implicit FileChecksum type vs making the option 
top-level like in your prototype. Part of the reason I took the approach I did 
(without distinguishing between CRC32 and CRC32C in the client-side option) is 
that since CRC32C vs CRC32 are properties of the underlying files, I wanted to 
preserve a way to specify a single option (i.e. 
dfs.checksum.combine.mode=COMPOSITE_CRC) that can return both CRC32C and 
CRC32-based aggregates in a single call. For example, I had created several 
CRC32-based files alongside CRC32C-based files simultaneously while testing:
{code:java}
$ hadoop fs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum 
hdfs:///tmp/random-crctest*.dat
18/01/25 00:01:12 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 
1.6.2-hadoop2
hdfs:///tmp/random-crctest-511bytes-chunksize.dat       COMPOSITE-CRC32C        
4db86e2b00000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest-gzipcrc32-2.dat      COMPOSITE-CRC32 
721d687e00000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest-gzipcrc32.dat        COMPOSITE-CRC32 
721d687e00000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest-legacy.dat   COMPOSITE-CRC32 
721d687e00000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest.dat  COMPOSITE-CRC32C        
4db86e2b00000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest2.dat COMPOSITE-CRC32C        
4db86e2b00000000000000000000000000000000000000000000000000000000
hdfs:///tmp/random-crctest3.dat COMPOSITE-CRC32C        
4db86e2b00000000000000000000000000000000000000000000000000000000
{code}
This could happen if a distcp was done with the option to preserve crc type 
attributes, but maybe block sizes were still changed and thus they became 
non-comparable with MD5MD5CRC; in the same listing, then 
FileChecksum.getAlgorithmName() could successfully denote which pairs of 
checksums are directly comparable.

Taking a look at the CRC combine implementation, the matrix based approach is 
interesting, looks like it's basically precomputing the 32 different [x^len, 
x^(len+1), ..., x^(len+31)] polynomials and then using the CRC value to bitmask 
over those. However, my approach skips trying to precompute all 32 x^(len+i) 
polynomials, instead just using the single x^len monomial because each 
x^(len+1) can be efficiently computed just by a shift-right and conditional 
XOR. A quick benchmark seems to show that without the length-based precomputes, 
the matrix approach is ~32x slower (I tested 10,000,000 concats taking ~230 
seconds with the non-precomputed matrix method vs ~7 seconds with my 
non-precomputed single-polynomial approach), which makes sense because it is 
building up 32 different polynomials instead of just 1. With length-based 
precompute it's roughly the same theoretical number of operations, but my test 
seemed to show the matrix approach 2x slower, probably due to memory access 
being slower than ~3 primitive ops, even from cache. This might be worth 
simplifying in zlib too.

Unfortunately I've been busy the last couple days, but I'm hoping to have some 
time to make a clean patch against trunk tomorrow.

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13056
>                 URL: https://issues.apache.org/jira/browse/HDFS-13056
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, distcp, erasure-coding, federation, hdfs
>    Affects Versions: 3.0.0
>            Reporter: Dennis Huo
>            Priority: Major
>         Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

Reply via email to