On Sat, 24 Jul 1999, Sleepycat Software wrote:

> You're concentrating on space compression and on specific types
> of data.  The prefix-compress feature was originally implemented
> for genome data sets, where keys are often quite long and often
> identical over long prefixes.

True. We're looking at it from a standpoint of a word index, which will
often have shorter keys, but lots of keys will often share a prefix.

> Why is this feature absolutely critical for htdig?

We're trying to generate our word database on-the-fly. There are
essentially two options:
1) Use each unique word in the collection as a key and keep the
document list as the data, including the necessary location information.
2) Use each *word* in the collection (possibly paired with a document ID)
as a key and the location information as the data. This obviously involves
the DB_DUP flag and storing many duplcate keys.

The problem we're facing is that #1 on a dynamically generated index
involves expanding variable-length lists and it's painfully slow with the
current Berkeley DB code.

On the other hand, #2 would involve a *lot* of keys, so we care about
ensuring fast access as well as space required to store all the keys.

For many reasons, we'd prefer option #1. In particular, it involves fewer
keys, making it faster to query. So we're interested in whatever option
works. Right now, we're trying #2. Fixing that would require some sort of
key compression. Fixing #1 requires some knowledge of your BTree format to
know why the expansion isn't as fast as we'd like.

> First, it's possible to do both key and data compression on leaf pages
> without affecting internal pages at all.  The literature that describes
> this calls them "minimum difference btrees", where each key/data pair is
> the minimum modification necessary to transform the previous key/data

If we go with option #2 as I've described it, this would be very
worthwhile since many data entries would be very similar. For example,
it's quite possible for frequent words to occur very close together,
yielding very compressible locations.

> Second, it would be interesting to do Huffman encoding of key/data pairs.
> This would be a straight time/space trade-off, and could be done entirely

Something similar may pay off here as well. We obviously know a great deal
about the likely format of both keys and data and could write a custom
compressor that would yield good results. We already do this for the data
entries since we know some of the fields are often zero.

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to