On Thu, Jan 06, 2011 at 11:44:31AM -0800, Peter Taps wrote:
> I have been told that the checksum value returned by Sha256 is almost
> guaranteed to be unique.

All hash functions are guaranteed to have collisions [for inputs larger
than their output anyways].

>                          In fact, if Sha256 fails in some case, we
> have a bigger problem such as memory corruption, etc. Essentially,
> adding verification to sha256 is an overkill.

What makes a hash function cryptographically secure is not impossibility
of collisions, but computational difficulty of finding arbitrary
colliding input pairs, collisions for known inputs, second pre-images,
and first pre-images.  Just because you can't easily find collisions on
purpose doesn't mean that you can't accidentally find collisions.

That said, if the distribution of SHA-256 is even enough then your
chances of finding a collision by accident are so remote (one in 2^128)
that you could reasonably decide that you don't care.

> Perhaps (Sha256+NoVerification) would work 99.999999% of the time. But
> (Fletcher+Verification) would work 100% of the time.

Fletcher is faster than SHA-256, so I think that must be what you're
asking about: "can Fletcher+Verification be faster than
Sha256+NoVerification?"  Or do you have some other goal?

Assuming I guessed correctly...  The speed of the hash function isn't
significant compared to the cost of the verification I/O, period, end of
story.  So, SHA-256 w/o verification will be faster than Fletcher +
Verification -- lots faster if you have particularly deduplicatious data
to write.  Moreorever, SHA-256 + verification will likely be somewhat
faster than Fletcher + verification because SHA-256 will likely have
fewer collisions than Fletcher, and the cost of I/O dominates the cost
of the hash functions.

> Which one of the two is a better deduplication strategy?
> 
> If we do not use verification with Sha256, what is the worst case
> scenario? Is it just more disk space occupied (because of failure to
> detect duplicate blocks) or there is a chance of actual data
> corruption (because two blocks were assumed to be duplicate although
> they are not)?

If you don't verify then you run the risk of corruption on collision,
NOT the risk of using too much disk space.

> Or, if I go with (Sha256+Verification), how much is the overhead of
> verification on the overall process?
> 
> If I do go with verification, it seems (Fletcher+Verification) is more
> efficient than (Sha256+Verification). And both are 100% accurate in
> detecting duplicate blocks.

You're confused.  Fletcher may be faster to compute than SHA-256, but
the run-time of both is as nothing compared to latency of the disk I/O
needed for verification, which means that the hash function's rate of
collisions is more important than its computational cost.

(Now, Fletcher is thought to not be a cryptographically secure hash
function, while SHA-256 is, for now, considered cryptographically
secure.  That probably means that the distribution of Fletcher's outputs
over random inputs is not as even as that of SHA-256, which probably
means you can expect more collisions with Fletcher than with SHA-256.
Note that I made no absolute statements in the previous sentence --
that's because I've not read any studies of Fletcher's performance
relative to SHA-256, thus I'm not certain of anything stated in the
previous sentence.)

David Magda's advice is spot on.

Nico
-- 
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to