On Thu, Jan 06, 2011 at 11:44:31AM -0800, Peter Taps wrote: > I have been told that the checksum value returned by Sha256 is almost > guaranteed to be unique.
All hash functions are guaranteed to have collisions [for inputs larger than their output anyways]. > In fact, if Sha256 fails in some case, we > have a bigger problem such as memory corruption, etc. Essentially, > adding verification to sha256 is an overkill. What makes a hash function cryptographically secure is not impossibility of collisions, but computational difficulty of finding arbitrary colliding input pairs, collisions for known inputs, second pre-images, and first pre-images. Just because you can't easily find collisions on purpose doesn't mean that you can't accidentally find collisions. That said, if the distribution of SHA-256 is even enough then your chances of finding a collision by accident are so remote (one in 2^128) that you could reasonably decide that you don't care. > Perhaps (Sha256+NoVerification) would work 99.999999% of the time. But > (Fletcher+Verification) would work 100% of the time. Fletcher is faster than SHA-256, so I think that must be what you're asking about: "can Fletcher+Verification be faster than Sha256+NoVerification?" Or do you have some other goal? Assuming I guessed correctly... The speed of the hash function isn't significant compared to the cost of the verification I/O, period, end of story. So, SHA-256 w/o verification will be faster than Fletcher + Verification -- lots faster if you have particularly deduplicatious data to write. Moreorever, SHA-256 + verification will likely be somewhat faster than Fletcher + verification because SHA-256 will likely have fewer collisions than Fletcher, and the cost of I/O dominates the cost of the hash functions. > Which one of the two is a better deduplication strategy? > > If we do not use verification with Sha256, what is the worst case > scenario? Is it just more disk space occupied (because of failure to > detect duplicate blocks) or there is a chance of actual data > corruption (because two blocks were assumed to be duplicate although > they are not)? If you don't verify then you run the risk of corruption on collision, NOT the risk of using too much disk space. > Or, if I go with (Sha256+Verification), how much is the overhead of > verification on the overall process? > > If I do go with verification, it seems (Fletcher+Verification) is more > efficient than (Sha256+Verification). And both are 100% accurate in > detecting duplicate blocks. You're confused. Fletcher may be faster to compute than SHA-256, but the run-time of both is as nothing compared to latency of the disk I/O needed for verification, which means that the hash function's rate of collisions is more important than its computational cost. (Now, Fletcher is thought to not be a cryptographically secure hash function, while SHA-256 is, for now, considered cryptographically secure. That probably means that the distribution of Fletcher's outputs over random inputs is not as even as that of SHA-256, which probably means you can expect more collisions with Fletcher than with SHA-256. Note that I made no absolute statements in the previous sentence -- that's because I've not read any studies of Fletcher's performance relative to SHA-256, thus I'm not certain of anything stated in the previous sentence.) David Magda's advice is spot on. Nico -- _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss