> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Peter Taps
> 
> Perhaps (Sha256+NoVerification) would work 99.999999% of the time. But

Append 50 more 9's on there. 
99.99999999999999999999999999999999999999999999999999999999%

See below.


> I have been told that the checksum value returned by Sha256 is almost
> guaranteed to be unique. In fact, if Sha256 fails in some case, we have a
> bigger problem such as memory corruption, etc. Essentially, adding
> verification to sha256 is an overkill.

Someone please correct me if I'm wrong.  I assume ZFS dedup matches both the
blocksize and the checksum right?  A simple checksum collision (which is
astronomically unlikely) is still not sufficient to produce corrupted data.
It's even more unlikely than that.

Using the above assumption, here's how you calculate the probability of
corruption if you're not using verification:

Suppose every single block in your whole pool is precisely the same size
(which is unrealistic in the real world, but I'm trying to calculate worst
case.)  Suppose the block is 4K, which is again, unrealistically worst case.
Suppose your dataset is purely random or sequential ... with no duplicated
data ... which is unrealisic because if your data is like that, then why in
the world are you enabling dedupe?  But again, assuming worst case
scenario...  At this point we'll throw in some evil clowns, spit on a voodoo
priestess, and curse the heavens for some extra bad luck.

If you have astronomically infinite quantities of data, then your
probability of corruption approaches 100%.  With infinite data, eventually
you're guaranteed to have a collision.  So the probability of corruption is
directly related to the total amount of data you have, and the new question
is:  For anything Earthly, how near are you to 0% probability of collision
in reality?

Suppose you have 128TB of data.  That is ...  you have 2^35 unique 4k blocks
of uniformly sized data.  Then the probability you have any collision in
your whole dataset is (sum(1 thru 2^35))*2^-256 
Note: sum of integers from 1 to N is  (N*(N+1))/2
Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35
Note: (N*(N+1))/2 in this case = 2^69 + 2^34
So the probability of data corruption in this case, is 2^-187 + 2^-222 ~=
5.1E-57 + 1.5E-67

~= 5.1E-57

In other words, even in the absolute worst case, cursing the gods, running
without verification, using data that's specifically formulated to try and
cause errors, on a dataset that I bet is larger than what you're doing, ...

Before we go any further ... The total number of bits stored on all the
storage in the whole planet is a lot smaller than the total number of
molecules in the planet.

There are estimated 8.87 * 10^49 molecules in planet Earth.

The probability of a collision in your worst-case unrealistic dataset as
described, is even 100 million times less likely than randomly finding a
single specific molecule in the whole planet Earth by pure luck.



_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to