Scott A Crosby wrote:
Birthday paradox is sqrt(count) not count/2.  So you'd expect a hit
after 64k inserts.

Arrgg! I knew that. Brain fade. That's embarrassing. Half the bits is not the same as half the number!


Ok, I agree with all you say. Jenkin's hash looks like a very good choice, now that I've Googled it to find out what it is :-)

Using a trick from Quinlan

Sean Quinlan? Venti? Or are you talking about something that Daniel Quinlan has done here? If the former, I still don't see how you get 2^i tokens in the database with an i bit key from a hash. As you said, the Birthday Paradox says you start getting collisions after 2^(i/2) tokens. On the other hand some collisions can be tolerated for our purpose: What difference does it really make if V!agrA hashes to the same thing as apophenia? Venti needed to ensure no collisions at all in very large sets because it is used for archival storage and retrieval, so it makes sense there to use a full 20 byte SHA-1 as key.


We can define what is a reasonable maximum number of tokens (allowing for eventually having multiword tokens) and how many collisions are tolerable, and set the number of bits based on that. But I don't think we would need more than 40 bits of hash.

 -- sidney



Reply via email to