Re: [htdig3-dev] Re: db compression [#948]

Andrew Scherpbier Mon, 26 Jul 1999 11:36:26 -0700

Sleepycat Software wrote:
> 
> > It seems to me that if the keys are huffman encoded before they are
> > given to the database (both at insertion time and at lookup time) you'd
> > get the best of both worlds:  smaller keys and no need to decode when
> > comparing keys.  You wouldn't be able to pack some of the data into the
> > key, but I am not sure how exactly that would be beneficial anyway.
> 
> I agree this is an interesting question.  The problem is that
> B+trees are sorted structures, and the data structure is often
> chosen because of locality of reference, i.e., if you look up
> record "aaaa", it is known that you're likely to look up record
> "aaab" too.
> 
> The question that I haven't thought through is if it's possible
> to Huffman encode data without destroying the sort order, i.e.,
> if I have "aaaa", "aaab" and "aaac", does the Huffman encoded
> representation sort in the exact same order?  (Remember, the
> sort routine is sorting bytes, not bits.)  If the sort order is
> changed, we'll have fundamentally changed how Btrees work, and
> in a very bad way.
> 
> So, the short answer is that I strongly agree with you iff the
> Huffman encoding sorts identically to the original key items.
> I don't know enough about Huffman engines to know the answer to
> that question.

The way I envision it working from the discussion so far, the 8-bit
characters will be huffman encoded into a variable length bitstream or
anywhere from 1 to 16(?) bits.  A sequence of characters will be
converted to a sequence of bitstreams, so as long as the bit order is
chosen correctly, the grouping of compressed keys will be close to the
grouping of uncompressed keys.  I don't think sorting will matter at all
since the domain has changed.

The only problem I see with this method is that prefix matches will be
much more complicated since there can potentially be 128 different
prefixes that will need to be looked up to get all possible words
matching a prefix. (Reasoning: worst case is when a compressed prefix
uses only one bit of the last byte and therefore the key space covered
by the other 7 bits will need to be iterated over.)
Maybe this can be solved using a custom compare function.

Cheers
-- 
Andrew Scherpbier <[EMAIL PROTECTED]>
Contigo Software <http://www.contigo.com/>

begin:vcard 
n:Scherpbier;Andrew
tel;fax:+1 619-278-2502
tel;work:+1 619-278-2329 x115
x-mozilla-html:FALSE
url:http://www.contigo.com/
org:Contigo Software<br><img src="http://www.contigo.com/gifs/logo.gif">;Research & Development
version:2.1
email;internet:[EMAIL PROTECTED]
title:Vice President R&D
note:<a href="http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=8334+Clairemont+Mesa+Blvd.&csz=San+Diego%2C+CA+92111&Get+Map=Get+Map">Yahoo Map</a>
adr;quoted-printable:;;8334 Clairemont Mesa Blvd.=0D=0ASuite 204;San Diego;CA;92111;USA
x-mozilla-cpt:;26848
fn:Andrew Scherpbier
end:vcard

Re: [htdig3-dev] Re: db compression [#948]

Reply via email to