On 2009-04-16, Adam Olsen <rha...@gmail.com> wrote: >>>>> The chance of *accidentally* producing a collision, although >>>>> technically possible, is so extraordinarily rare that it's >>>>> completely overshadowed by the risk of a hardware or software >>>>> failure producing an incorrect result. >>>> >>>> Not when you're using them to compare lots of files. >> >>>> Trust me. Been there, done that, got the t-shirt.
I must admit, that is a bit hard to believe. >> Okay, before I tell you about the empirical, real-world >> evidence I have could you please accept that hashes collide >> and that no matter how many samples you use the probability of >> finding two files that do collide is small but not zero. > > I'm afraid you will need to back up your claims with real files. > Although MD5 is a smaller, older hash (128 bits, so you only need > 2**64 files to find collisions), You don't need quite that many to have a significant chance of a collision. With "only" something on the order of 2**61 files, you still have about a 1% chance of a collision. Just for fun here are a few data points showing the probability of at least one collision in a sample of size n using a perfect 128-bit hash. These were calculated using the taylor-series approximation shown at http://en.wikipedia.org/wiki/Birthday_paradox#Calculating_the_probability For "a few million files" (we'll say 4e6), the probability of a collision is so close to 0 that it can't be calculated using double-precision IEEE floats. For a few _billion_ files, the taylor series approximation is still 0 to about 15 significant digits. For a few _trillion_ files, you finally get a "non-zero" probability of about 2e-14. Here are a few more points: n = 1806791225998070; p(n) = 0.00000000 n = 1987470348597877; p(n) = 0.00000001 n = 2186217383457665; p(n) = 0.00000001 n = 2404839121803431; p(n) = 0.00000001 n = 2645323033983774; p(n) = 0.00000001 n = 2909855337382151; p(n) = 0.00000001 n = 3200840871120366; p(n) = 0.00000002 n = 3520924958232403; p(n) = 0.00000002 n = 3873017454055643; p(n) = 0.00000002 n = 4260319199461207; p(n) = 0.00000003 n = 4686351119407328; p(n) = 0.00000003 n = 5154986231348061; p(n) = 0.00000004 n = 5670484854482868; p(n) = 0.00000005 n = 6237533339931155; p(n) = 0.00000006 n = 6861286673924271; p(n) = 0.00000007 n = 7547415341316699; p(n) = 0.00000008 n = 8302156875448370; p(n) = 0.00000010 n = 9132372562993208; p(n) = 0.00000012 n = 10045609819292530; p(n) = 0.00000015 n = 11050170801221784; p(n) = 0.00000018 n = 12155187881343964; p(n) = 0.00000022 n = 13370706669478362; p(n) = 0.00000026 n = 14707777336426200; p(n) = 0.00000032 n = 16178555070068822; p(n) = 0.00000038 n = 17796410577075706; p(n) = 0.00000047 n = 19576051634783280; p(n) = 0.00000056 n = 21533656798261608; p(n) = 0.00000068 n = 23687022478087772; p(n) = 0.00000082 n = 26055724725896552; p(n) = 0.00000100 n = 28661297198486208; p(n) = 0.00000121 n = 31527426918334832; p(n) = 0.00000146 n = 34680169610168320; p(n) = 0.00000177 n = 38148186571185152; p(n) = 0.00000214 n = 41963005228303672; p(n) = 0.00000259 n = 46159305751134040; p(n) = 0.00000313 n = 50775236326247448; p(n) = 0.00000379 n = 55852759958872200; p(n) = 0.00000458 n = 61438035954759424; p(n) = 0.00000555 n = 67581839550235368; p(n) = 0.00000671 n = 74340023505258912; p(n) = 0.00000812 n = 81774025855784816; p(n) = 0.00000983 n = 89951428441363312; p(n) = 0.00001189 n = 98946571285499648; p(n) = 0.00001439 n = 108841228414049616; p(n) = 0.00001741 n = 119725351255454592; p(n) = 0.00002106 n = 131697886381000064; p(n) = 0.00002548 n = 144867675019100096; p(n) = 0.00003084 n = 159354442521010112; p(n) = 0.00003731 n = 175289886773111136; p(n) = 0.00004515 n = 192818875450422272; p(n) = 0.00005463 n = 212100762995464512; p(n) = 0.00006610 n = 233310839295010976; p(n) = 0.00007998 n = 256641923224512096; p(n) = 0.00009678 n = 282306115546963328; p(n) = 0.00011710 n = 310536727101659712; p(n) = 0.00014169 n = 341590399811825728; p(n) = 0.00017144 n = 375749439793008320; p(n) = 0.00020744 n = 413324383772309184; p(n) = 0.00025099 n = 454656822149540160; p(n) = 0.00030369 n = 500122504364494208; p(n) = 0.00036745 n = 550134754800943680; p(n) = 0.00044460 n = 605148230281038080; p(n) = 0.00053794 n = 665663053309141888; p(n) = 0.00065088 n = 732229358640056192; p(n) = 0.00078751 n = 805452294504061824; p(n) = 0.00095280 n = 885997523954468096; p(n) = 0.00115278 n = 974597276349915008; p(n) = 0.00139469 n = 1072057003984906624; p(n) = 0.00168733 n = 1179262704383397376; p(n) = 0.00204131 n = 1297188974821737216; p(n) = 0.00246945 n = 1426907872303911168; p(n) = 0.00298726 n = 1569598659534302464; p(n) = 0.00361345 n = 1726558525487732736; p(n) = 0.00437061 n = 1899214378036506112; p(n) = 0.00528601 n = 2089135815840156928; p(n) = 0.00639252 n = 2298049397424172800; p(n) = 0.00772975 n = 2527854337166590464; p(n) = 0.00934539 n = 2780639770883249664; p(n) = 0.01129680 Here's the Python function I'm using: def bp(n, d): return 1.0 - exp(-n*(n-1.)/(2.*d)) I haven't spent much time studying the numerical issues of the way that the exponent is calculated, so I'm not entirely confident in the results for "small" n values such that p(n) == 0.0. -- Grant Edwards grante Yow! ... the MYSTERIANS are at in here with my CORDUROY visi.com SOAP DISH!! -- http://mail.python.org/mailman/listinfo/python-list