[ https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341933#comment-16341933 ]
Dennis Huo commented on HDFS-13056: ----------------------------------- Thanks for sharing! I was also weighing whether to modify OpBlockChecksum in place, and I only decided to initially go with adding the new Op just to make it easy to verify at a glance that the old code path is 100% unchanged, since I figured CRCs are sensitive for existing deployments, and folks may want to minimize risk of breakage even if it costs a bit more boilerplate code. But I can also see the appeal of putting it in the existing OpBlockChecksum; either way we can add protections to preserve default behavior. I'm open to input if you or anyone else feels strongly about this. When I was thinking about how to expose the options, it was difficult to come up with an entirely satisfactory approach, so it wasn't entirely obvious whether it's better to have completely orthogonal "combine mode" vs "data checksum types" determine an implicit FileChecksum type vs making the option top-level like in your prototype. Part of the reason I took the approach I did (without distinguishing between CRC32 and CRC32C in the client-side option) is that since CRC32C vs CRC32 are properties of the underlying files, I wanted to preserve a way to specify a single option (i.e. dfs.checksum.combine.mode=COMPOSITE_CRC) that can return both CRC32C and CRC32-based aggregates in a single call. For example, I had created several CRC32-based files alongside CRC32C-based files simultaneously while testing: {code:java} $ hadoop fs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum hdfs:///tmp/random-crctest*.dat 18/01/25 00:01:12 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.2-hadoop2 hdfs:///tmp/random-crctest-511bytes-chunksize.dat COMPOSITE-CRC32C 4db86e2b00000000000000000000000000000000000000000000000000000000 hdfs:///tmp/random-crctest-gzipcrc32-2.dat COMPOSITE-CRC32 721d687e00000000000000000000000000000000000000000000000000000000 hdfs:///tmp/random-crctest-gzipcrc32.dat COMPOSITE-CRC32 721d687e00000000000000000000000000000000000000000000000000000000 hdfs:///tmp/random-crctest-legacy.dat COMPOSITE-CRC32 721d687e00000000000000000000000000000000000000000000000000000000 hdfs:///tmp/random-crctest.dat COMPOSITE-CRC32C 4db86e2b00000000000000000000000000000000000000000000000000000000 hdfs:///tmp/random-crctest2.dat COMPOSITE-CRC32C 4db86e2b00000000000000000000000000000000000000000000000000000000 hdfs:///tmp/random-crctest3.dat COMPOSITE-CRC32C 4db86e2b00000000000000000000000000000000000000000000000000000000 {code} This could happen if a distcp was done with the option to preserve crc type attributes, but maybe block sizes were still changed and thus they became non-comparable with MD5MD5CRC; in the same listing, then FileChecksum.getAlgorithmName() could successfully denote which pairs of checksums are directly comparable. Taking a look at the CRC combine implementation, the matrix based approach is interesting, looks like it's basically precomputing the 32 different [x^len, x^(len+1), ..., x^(len+31)] polynomials and then using the CRC value to bitmask over those. However, my approach skips trying to precompute all 32 x^(len+i) polynomials, instead just using the single x^len monomial because each x^(len+1) can be efficiently computed just by a shift-right and conditional XOR. A quick benchmark seems to show that without the length-based precomputes, the matrix approach is ~32x slower (I tested 10,000,000 concats taking ~230 seconds with the non-precomputed matrix method vs ~7 seconds with my non-precomputed single-polynomial approach), which makes sense because it is building up 32 different polynomials instead of just 1. With length-based precompute it's roughly the same theoretical number of operations, but my test seemed to show the matrix approach 2x slower, probably due to memory access being slower than ~3 primitive ops, even from cache. This might be worth simplifying in zlib too. Unfortunately I've been busy the last couple days, but I'm hoping to have some time to make a clean patch against trunk tomorrow. > Expose file-level composite CRCs in HDFS which are comparable across > different instances/layouts > ------------------------------------------------------------------------------------------------ > > Key: HDFS-13056 > URL: https://issues.apache.org/jira/browse/HDFS-13056 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, distcp, erasure-coding, federation, hdfs > Affects Versions: 3.0.0 > Reporter: Dennis Huo > Priority: Major > Attachments: HDFS-13056-branch-2.8.001.patch, > HDFS-13056-branch-2.8.poc1.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, > hdfs-file-composite-crc32-v1.pdf > > > FileChecksum was first introduced in > [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then > has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are > already stored as part of datanode metadata, and the MD5 approach is used to > compute an aggregate value in a distributed manner, with individual datanodes > computing the MD5-of-CRCs per-block in parallel, and the HDFS client > computing the second-level MD5. > > A shortcoming of this approach which is often brought up is the fact that > this FileChecksum is sensitive to the internal block-size and chunk-size > configuration, and thus different HDFS files with different block/chunk > settings cannot be compared. More commonly, one might have different HDFS > clusters which use different block sizes, in which case any data migration > won't be able to use the FileChecksum for distcp's rsync functionality or for > verifying end-to-end data integrity (on top of low-level data integrity > checks applied at data transfer time). > > This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 > during the addition of checksum support for striped erasure-coded files; > while there was some discussion of using CRC composability, it still > ultimately settled on hierarchical MD5 approach, which also adds the problem > that checksums of basic replicated files are not comparable to striped files. > > This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses > CRC composition to remain completely chunk/block agnostic, and allows > comparison between striped vs replicated files, between different HDFS > instances, and possible even between HDFS and other external storage systems. > This feature can also be added in-place to be compatible with existing block > metadata, and doesn't need to change the normal path of chunk verification, > so is minimally invasive. This also means even large preexisting HDFS > deployments could adopt this feature to retroactively sync data. A detailed > design document can be found here: > https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org