On Tue, 09 Mar 2004 17:13:24 +1300, Sidney Markowitz <[EMAIL PROTECTED]> writes:

> Scott A Crosby wrote:


> > Using a trick from Quinlan
> 
> Sean Quinlan? Venti? Or are you talking about something that Daniel
> Quinlan has done here? If the former, I still don't see how you get
> 2^i tokens in the database with an i bit key from a hash. As you
> said, the Birthday Paradox says you start getting collisions after
> 2^(i/2) tokens. 

His insight was that the location in the hash table contributes
information that is redundant.

Implementing it, here's my algorithm: Define three keyed hash functions


  H1(x) -> Maps a string to a *page*
  H2(x) -> Maps a string to an offset on a page.
  H3(x) -> Maps a string to a hash value.

I divide the hash table into 'pages'. Consisting of 256 slots for 256
keys each. That means that H1 returns $log(#size)-8$ bits. Since I
have the property that a key can ONLY be stored in the page returned
by H1, those hash bits are *implicit* and don't need to be stored.

Now on each page, I use H2 to choose at what offset on the page to
look for the key, with, probably, quadratic probing. If I reach the
end of a page, wraparound to the beginning.

To disambiguate keys on the same page, Each slot in the hash table
stores the output of H3 explicitly. That serves to disambiguate when
doing a search. Thus, H3 are the only bits that need explicit storing. 

With 1 million keys, H1 returns one of 4k pages, H3 returns 28 bits,
10 bits each for counters and 16 bits for circular last-use time,
gives 64 bits and because of H1, an effective keysize of 12+28=40
bits. If the hashtable grows in size, H1 gets more bits. 

Its minimalistic, perhaps too much --- there's no way to rehash or
change the table size once it is built --- but 8 bytes/key is hard to
beat. I'm sure you can see other variants and different design
tradeoffs with this approach.

If you're willing to do 11 bytes/key, I'd reject this whole 'implicit
bits' and page approach. and store 48 bits hashkey, 2*12 bits
counts. 16-bit LRU. That way you can resize and rehash at any time.

> We can define what is a reasonable maximum number of tokens (allowing
> for eventually having multiword tokens) and how many collisions are
> tolerable, and set the number of bits based on that. But I don't think
> we would need more than 40 bits of hash.

Agreed, except I'd go to 48 bits for robustness. 1 million keys is
possible, even likely in some situations. 16 million before the fist
birthday paradox is much less so.

Scott

Reply via email to