For anyone who still cares:

I'm calculating the odds of a sha256 collision in an extremely large zpool,
containing 2^35 blocks of data, and no repetitions.

The formula on wikipedia for the birthday problem is:
p(n;d) ~= 1-( (d-1)/d )^( 0.5*n*(n-1) )

In this case, 
n=2^35
d=2^256

The problem is, this formula "does not compute" because n is so large.
Fortunately x = e^ln(x) and so we're able to use this technique to make the
huge exponent instead, a huge multiplication.

(Using the bc mathlib notation, the l(x) function is the natural log of x,
and e(x) is e raised to the power of x)
p(n;d) ~= 1-e(   (  0.5*n*(n-1)*l((d-1)/d)  )   )

Using bc to calculate the answer:
bc -l
 
n=2^35
d=2^256
scale=1024
1-e(   (  0.5*n*(n-1)*l((d-1)/d)  )   )
.0000000000000000000000000000000000000000000000000000000050978941154
I manually truncated here (precision goes out 1024 places).  This is
5.1*10^-57

Note: I had to repeat the calculation many times, setting a larger and
larger scale.  The default scale of 20, and even 64 and 70 and 80 were not
precise enough to produce a convergent answer around the -57th decimal
place.  So I just kept going larger, and in retrospect, anything over 100
would have been fine.  I wrote 1024 above, so who cares.

If you've been paying close attention you'll recognize this is the same
answer I originally calculated, but my original equation was in fact wrong.
It just so happens that my original equation neglects the probability of
collisions from previous events, so it is actually accurate whenever the
probability of previous events is insignificant.  It is merely luck that for
the data size in question, my equation produced something that looked
correct.  It would produce a wrong calculation of probability for a larger
value of n.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to