I think I caught the optimal usage pattern for db. Each
key is 4 bytes long and contain a integer. No prefix routine, a 
custom comparison routine that compares integers. Data is the same
as the key. 

0x53162 Btree magic number.
6       Btree version number.
Flags:
2       Minimum keys per-page.
4096    Underlying tree page size.
4       Number of levels in the tree.
59M     Number of keys in the tree.
1316    Number of tree internal pages.
294392  Number of tree leaf pages.
0       Number of tree duplicate pages.
0       Number of tree overflow pages.
0       Number of pages on the free list.
33406   Number of bytes free in tree internal pages (99% ff).
8835440 Number of bytes free in tree leaf pages (99% ff).
0       Number of bytes free in tree duplicate pages (0% ff).
0       Number of bytes free in tree overflow pages (0% ff).

        The file is 1.2Gb big. As you can see it contains 60 millions
entries and space lost is 1%. I did that to test what we can expect if
entries are compressed (as suggested by Keith, using huffman with 
static freq). 
        I would say that this is encouraging. There is still a mystery
though (at least for me :-). Why do I have 99% fill in this usage pattern
and 60% only when using keys and data whose size vary between 6 bytes
and 35 bytes, for an average size of 8 bytes. 
        When we will compress entries, the size of the keys and the size
of the data will not be fixed size. Will we lose 40% space then ?

        In conclusion, assuming that 1 document contains 100 words in 
average and assuming that we can come close to this usage pattern using
static huffman compression, we would be able to index more than 500 000
documents in a 1.2gb word file.

-- 
                Loic Dachary

                ECILA
                100 av. du Gal Leclerc
                93500 Pantin - France
                Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
                e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to