Ed, with all due respect to your math,

I've seen rsync bomb due to an SHA256 collision, so I know it can and does 
happen.

I respect my data, so even with checksumming and comparing the block size, I'll 
still do a comparison check if those two match.  You will end up with silent 
data corruption which could affect you in so many ways.

Do you want to stake your career and reputation on that?  With a client or 
employer's data? I sure don't.

"Those who walk on the razor's edge are destined to be cut to ribbons…" Someone 
I used to work with said that, not me.

For my home media server, maybe, but even then I'd hate to lose any of my 
family photos or video due to a hash collision.

I'll play it safe if I dedup.

Mike

---
Michael Sullivan                   
michael.p.sulli...@me.com
http://www.kamiogi.net/
Mobile: +1-662-202-7716
US Phone: +1-561-283-2034
JP Phone: +81-50-5806-6242

On 7 Jan 2011, at 00:05 , Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Peter Taps
>> 
>> Perhaps (Sha256+NoVerification) would work 99.999999% of the time. But
> 
> Append 50 more 9's on there. 
> 99.99999999999999999999999999999999999999999999999999999999%
> 
> See below.
> 
> 
>> I have been told that the checksum value returned by Sha256 is almost
>> guaranteed to be unique. In fact, if Sha256 fails in some case, we have a
>> bigger problem such as memory corruption, etc. Essentially, adding
>> verification to sha256 is an overkill.
> 
> Someone please correct me if I'm wrong.  I assume ZFS dedup matches both the
> blocksize and the checksum right?  A simple checksum collision (which is
> astronomically unlikely) is still not sufficient to produce corrupted data.
> It's even more unlikely than that.
> 
> Using the above assumption, here's how you calculate the probability of
> corruption if you're not using verification:
> 
> Suppose every single block in your whole pool is precisely the same size
> (which is unrealistic in the real world, but I'm trying to calculate worst
> case.)  Suppose the block is 4K, which is again, unrealistically worst case.
> Suppose your dataset is purely random or sequential ... with no duplicated
> data ... which is unrealisic because if your data is like that, then why in
> the world are you enabling dedupe?  But again, assuming worst case
> scenario...  At this point we'll throw in some evil clowns, spit on a voodoo
> priestess, and curse the heavens for some extra bad luck.
> 
> If you have astronomically infinite quantities of data, then your
> probability of corruption approaches 100%.  With infinite data, eventually
> you're guaranteed to have a collision.  So the probability of corruption is
> directly related to the total amount of data you have, and the new question
> is:  For anything Earthly, how near are you to 0% probability of collision
> in reality?
> 
> Suppose you have 128TB of data.  That is ...  you have 2^35 unique 4k blocks
> of uniformly sized data.  Then the probability you have any collision in
> your whole dataset is (sum(1 thru 2^35))*2^-256 
> Note: sum of integers from 1 to N is  (N*(N+1))/2
> Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35
> Note: (N*(N+1))/2 in this case = 2^69 + 2^34
> So the probability of data corruption in this case, is 2^-187 + 2^-222 ~=
> 5.1E-57 + 1.5E-67
> 
> ~= 5.1E-57
> 
> In other words, even in the absolute worst case, cursing the gods, running
> without verification, using data that's specifically formulated to try and
> cause errors, on a dataset that I bet is larger than what you're doing, ...
> 
> Before we go any further ... The total number of bits stored on all the
> storage in the whole planet is a lot smaller than the total number of
> molecules in the planet.
> 
> There are estimated 8.87 * 10^49 molecules in planet Earth.
> 
> The probability of a collision in your worst-case unrealistic dataset as
> described, is even 100 million times less likely than randomly finding a
> single specific molecule in the whole planet Earth by pure luck.
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to