Hi,
I finaly found out what happens with prefix compression.
When a leaf page of the btree is split because it has no room for
a new key, an entry is added in the parent page to reference the new
leaf page. The key of the parent page entry may be reduced by prefix
compression (assuming the last key of the initial leaf page and the last
key of the new page have something in common).
In short, only keys stored in the internal btree pages are
compressed. Since the number of internal pages in a btree is really
small compared to the number of leaf pages, the expected gain is small.
What is more suprising is that more internal pages are used
when prefix compression is used. I've inserted 5 000 000 keys (50 000
alphabetical words, each of them inserted with a suffix number going from
0 to 99, in ascii). With prefix compression I have 548 internal pages,
without prefix compression I only have 546 internal pages.
My conclusion is that prefix compression is completely useless
the way it's implemented. It would save a lot of space if the leaf
nodes only contained the suffix of the keys. I.e.
keys: passage, password, passway
internal page key : pass
leaf node keys : word
age
way
I understand, however, that this requires a lot of modifications
in the db code and we cannot expect it to work in short delays. Since
this feature is absolutely critical for htdig, I'm ready to spend all
the time needed to implement it. I'm not familiar with your contribution
mechanism, however. Can I have access to a CVS tree ? Can I expect a short
delay feedback when the patch is ready a passes all the regression tests ?
Here are the stats of the database built:
#
# Without prefix function 100 documents
#
make DISABLE_PREFIX=-P prefix
make[1]: Entering directory `/home/loic/local/ports/htdig3/bench'
Reading from words.all ... pushed 59467 words (100 documents)
0x53162 Btree magic number.
6 Btree version number.
Flags: DB_DUP
2 Minimum keys per-page.
4096 Underlying tree page size.
4 Number of levels in the tree.
5946700 Number of keys in the tree.
546 Number of tree internal pages.
78095 Number of tree leaf pages.
0 Number of tree duplicate pages.
0 Number of tree overflow pages.
0 Number of pages on the free list.
126556 Number of bytes free in tree internal pages (94% ff).
131M Number of bytes free in tree leaf pages (59% ff).
0 Number of bytes free in tree duplicate pages (0% ff).
0 Number of bytes free in tree overflow pages (0% ff).
#
# With prefix function 100 documents
#
make prefix
make[1]: Entering directory `/home/loic/local/ports/htdig3/bench'
Reading from words.all ... pushed 59467 words (100 documents)
0x53162 Btree magic number.
6 Btree version number.
Flags: DB_DUP
2 Minimum keys per-page.
4096 Underlying tree page size.
4 Number of levels in the tree.
5946700 Number of keys in the tree.
548 Number of tree internal pages.
78095 Number of tree leaf pages.
0 Number of tree duplicate pages.
0 Number of tree overflow pages.
0 Number of pages on the free list.
138888 Number of bytes free in tree internal pages (94% ff).
131M Number of bytes free in tree leaf pages (59% ff).
0 Number of bytes free in tree duplicate pages (0% ff).
0 Number of bytes free in tree overflow pages (0% ff).
P.S. Another concern is the huge amount of free space (40%) in leaf pages,
but this is another subject :-)
--
Loic Dachary
ECILA
100 av. du Gal Leclerc
93500 Pantin - France
Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.