On Sat, Jan 15, 2011 at 10:19:23AM -0600, Bob Friesenhahn wrote:
> On Fri, 14 Jan 2011, Peter Taps wrote:
> 
> >Thank you for sharing the calculations. In lay terms, for Sha256,
> >how many blocks of data would be needed to have one collision?
> 
> Two.

Pretty funny.

In this thread some of you are treating SHA-256 as an idealized hash
function.  The odds of accidentally finding collisions in an idealized
256-bit hash function are minute because the distribution of hash
function outputs over inputs is random (or, rather, pseudo-random).

But cryptographic hash functions are generally only approximations of
idealized hash functions.  There's nothing to say that there aren't
pathological corner cases where a given hash function produces lots of
collisions that would be semantically meaningful to people -- i.e., a
set of inputs over which the outputs are not randomly distributed.  Now,
of course, we don't know of such pathological corner cases for SHA-256,
but not that long ago we didn't know of any for SHA-1 or MD5 either.

The question of whether disabling verification would improve performance
is pretty simple: if you have highly deduplicatious, _synchronous_ (or
nearly so, due to frequent fsync()s or NFS close operations) writes, and
the "working set" did not fit in the ARC nor L2ARC, then yes, disabling
verification will help significantly, by removing an average of at least
half a disk rotation from the write latency.  Or if you have the same
work load but with asynchronous writes that might as well be synchronous
due to an undersized cache (relative to the workload).  Otherwise the
cost of verification should be hidden by caching.

Another way to put this would be that you should first determine that
verification is actually affecting performance, and only _then_ should
you consider disabling it.  But if you want to have the freedom to
disable verficiation, then you should be using SHA-256 (or switch to it
when disabling verification).

    Safety features that cost nothing are not worth turning off,
    so make sure their cost is significant before even thinking
    of turning them off.

Similarly, the cost of SHA-256 vs. Fletcher should also be lost in the
noise if the system has enough CPU, but if the choice of hash function
could make the system CPU-bound instead of I/O-bound, then the choice of
hash function would make an impact on performance.  The choice of hash
functions will have a different performance impact than verification: a
slower hash function will affect non-deduplicatious workloads more than
highly deduplicatious workloads (since the latter will require more I/O
for verification, which will overwhelm the cost of the hash function).
Again, measure first.

Nico
-- 
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to