Sleepycat Software wrote: > > > It seems to me that if the keys are huffman encoded before they are > > given to the database (both at insertion time and at lookup time) you'd > > get the best of both worlds: smaller keys and no need to decode when > > comparing keys. You wouldn't be able to pack some of the data into the > > key, but I am not sure how exactly that would be beneficial anyway. > > I agree this is an interesting question. The problem is that > B+trees are sorted structures, and the data structure is often > chosen because of locality of reference, i.e., if you look up > record "aaaa", it is known that you're likely to look up record > "aaab" too. > > The question that I haven't thought through is if it's possible > to Huffman encode data without destroying the sort order, i.e., > if I have "aaaa", "aaab" and "aaac", does the Huffman encoded > representation sort in the exact same order? (Remember, the > sort routine is sorting bytes, not bits.) If the sort order is > changed, we'll have fundamentally changed how Btrees work, and > in a very bad way. > > So, the short answer is that I strongly agree with you iff the > Huffman encoding sorts identically to the original key items. > I don't know enough about Huffman engines to know the answer to > that question. The way I envision it working from the discussion so far, the 8-bit characters will be huffman encoded into a variable length bitstream or anywhere from 1 to 16(?) bits. A sequence of characters will be converted to a sequence of bitstreams, so as long as the bit order is chosen correctly, the grouping of compressed keys will be close to the grouping of uncompressed keys. I don't think sorting will matter at all since the domain has changed. The only problem I see with this method is that prefix matches will be much more complicated since there can potentially be 128 different prefixes that will need to be looked up to get all possible words matching a prefix. (Reasoning: worst case is when a compressed prefix uses only one bit of the last byte and therefore the key space covered by the other 7 bits will need to be iterated over.) Maybe this can be solved using a custom compare function. Cheers -- Andrew Scherpbier <[EMAIL PROTECTED]> Contigo Software <http://www.contigo.com/>
begin:vcard n:Scherpbier;Andrew tel;fax:+1 619-278-2502 tel;work:+1 619-278-2329 x115 x-mozilla-html:FALSE url:http://www.contigo.com/ org:Contigo Software<br><img src="http://www.contigo.com/gifs/logo.gif">;Research & Development version:2.1 email;internet:[EMAIL PROTECTED] title:Vice President R&D note:<a href="http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=8334+Clairemont+Mesa+Blvd.&csz=San+Diego%2C+CA+92111&Get+Map=Get+Map">Yahoo Map</a> adr;quoted-printable:;;8334 Clairemont Mesa Blvd.=0D=0ASuite 204;San Diego;CA;92111;USA x-mozilla-cpt:;26848 fn:Andrew Scherpbier end:vcard
