>> There is a default prefix compression routine:
>> If your application-specified prefix compression function does not
>> perform as well as the default one, this would be the expected outcome.
>
> I was using the default prefix function.
If you specified the default prefix function as your prefix function, and
got different results than specifying no prefix function, I'm interested
in tracking that down, that sounds like a problem.
>> Why is this feature absolutely critical for htdig?
>
> It is because the approach chosen at present implies to create an entry
> for every word occurence in every document. That is, if we have 100 documents
> containing 100 words, we will create 10 000 entries in the btree. Each key
> is the word + document number. Each existing word will therefore be the
> prefix of a large number of keys. Tests show that it takes 500 Mb to store
> 11 million entries with the current db implementation. If we assume that
> each document contains 100 words in average, this will mean that indexing
> 100 000 documents will take 500 Mb. It's too much.
Why is it too much? How many sites index 100,000 documents? My
bet is that it's not many, and last time I checked, 500MB of
disk cost under $7 US. How many sites are going to benefit from
this work?
>> We're happy to provide snapshots of our source tree, we don't currently
>> export CVS access, although I could probably be talked into doing that
>> in September (we use SCCS internally, but will probably be switching to
>> CVS in late August).
>
> This is good news :-) Could you tell me where to download the latest
> snapshot?
www.sleepycat.com
The latest download is 2.7.5.
> I'll have to discuss with Geoff about the leaf page compression
> you suggested. If we take this approach it may be convinient to just
> compress the pages (feature of the mpool, for instance). Although this
> is a more general solution to the space/time problem, I can't imagine
> how hard it is for the buffer pool to manage pages that have a different
> size on disk (compressed) and in memory (uncompressed).
That's hard. Writing a buffer pool to have variable sized pages is
a nasty little problem. I remember one paper that's probably worth
finding: I think it was SOSP 1993, and the work was done at DECSRC.
The title was something about compression in a log-structured filesystem.
(If you can't find it, let me know, I can probably dig up my copy.)
Anyway, I haven't done an implementation, so I can't say for sure, but
I seriously doubt this approach will work well.
My inclination is to do the Huffman encoding -- that's going to be
relatively straight-forward and is useful regardless of doing the
other work. Plus, it should be straight-forward to calculate how
much compression you'll get by selecting an encoding, and then use
English character frequencies to determine the savings.
Regards,
--keith
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Keith Bostic
Sleepycat Software Inc. [EMAIL PROTECTED]
394 E. Riding Dr. +1-978-287-4781
Carlisle, MA 01741-1601 http://www.sleepycat.com
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.