> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Peter Taps > > Perhaps (Sha256+NoVerification) would work 99.999999% of the time. But
Append 50 more 9's on there. 99.99999999999999999999999999999999999999999999999999999999% See below. > I have been told that the checksum value returned by Sha256 is almost > guaranteed to be unique. In fact, if Sha256 fails in some case, we have a > bigger problem such as memory corruption, etc. Essentially, adding > verification to sha256 is an overkill. Someone please correct me if I'm wrong. I assume ZFS dedup matches both the blocksize and the checksum right? A simple checksum collision (which is astronomically unlikely) is still not sufficient to produce corrupted data. It's even more unlikely than that. Using the above assumption, here's how you calculate the probability of corruption if you're not using verification: Suppose every single block in your whole pool is precisely the same size (which is unrealistic in the real world, but I'm trying to calculate worst case.) Suppose the block is 4K, which is again, unrealistically worst case. Suppose your dataset is purely random or sequential ... with no duplicated data ... which is unrealisic because if your data is like that, then why in the world are you enabling dedupe? But again, assuming worst case scenario... At this point we'll throw in some evil clowns, spit on a voodoo priestess, and curse the heavens for some extra bad luck. If you have astronomically infinite quantities of data, then your probability of corruption approaches 100%. With infinite data, eventually you're guaranteed to have a collision. So the probability of corruption is directly related to the total amount of data you have, and the new question is: For anything Earthly, how near are you to 0% probability of collision in reality? Suppose you have 128TB of data. That is ... you have 2^35 unique 4k blocks of uniformly sized data. Then the probability you have any collision in your whole dataset is (sum(1 thru 2^35))*2^-256 Note: sum of integers from 1 to N is (N*(N+1))/2 Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35 Note: (N*(N+1))/2 in this case = 2^69 + 2^34 So the probability of data corruption in this case, is 2^-187 + 2^-222 ~= 5.1E-57 + 1.5E-67 ~= 5.1E-57 In other words, even in the absolute worst case, cursing the gods, running without verification, using data that's specifically formulated to try and cause errors, on a dataset that I bet is larger than what you're doing, ... Before we go any further ... The total number of bits stored on all the storage in the whole planet is a lot smaller than the total number of molecules in the planet. There are estimated 8.87 * 10^49 molecules in planet Earth. The probability of a collision in your worst-case unrealistic dataset as described, is even 100 million times less likely than randomly finding a single specific molecule in the whole planet Earth by pure luck. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss