On Tue, Nov 16, 2010 at 9:05 PM, Dennis Gearon <gear...@sbcglobal.net> wrote: > Read up on WikiPedia, but I believe that no Hash Function is much good above > 50% > of the address space it generates.
50% is way to high - collisions will happen before that. But given that something like MD5 has 128 bits, that's 3.4e38, so even a small fraction of that address space will work. The probabilities follow the "birthday problem": http://en.wikipedia.org/wiki/Birthday_problem Using a 128 bit hash, you can hash 26B docs with a hash collision probability of e-18 (and yes, that is lower than the probability of something else going wrong). It also says: "For comparison, 10-18 to 10-15 is the uncorrectable bit error rate of a typical hard disk [2]. In theory, MD5, 128 bits, should stay within that range until about 820 billion documents, even if its possible outputs are many more." -Yonik http://www.lucidimagination.com