[htdig3-dev] Re: Theory of IR and current word index format

loic Sat, 24 Jul 1999 10:09:34 -0700

 > OK, before we start hacking the Berkeley DB code to continue with this
 > format, we should decide if it's the right way to go.

 From the mail from Keith Bostic, I'd say that we should give up this
approach. We could compress leaf pages of db. But I've been thru this
with reiserfs and the implications are not trivial. Another factor
advocates in favor of fewer keys : the fact that db seems to
constantly leave 40% free space in leaf nodes when we have more than 5
million keys. Here is the result for a 11 million keys db:

#
# With prefix function 200 documents
#
make prefix
make[1]: Entering directory `/home/loic/local/ports/htdig3/bench'
Reading from words.all ... pushed 59467 words (200 documents) 
no prefix : 54 bytes per word
0x53162 Btree magic number.
6       Btree version number.
Flags: DB_DUP
2       Minimum keys per-page.
4096    Underlying tree page size.
4       Number of levels in the tree.
11M     Number of keys in the tree.
1208    Number of tree internal pages.
156652  Number of tree leaf pages.
0       Number of tree duplicate pages.
0       Number of tree overflow pages.
0       Number of pages on the free list.
698290  Number of bytes free in tree internal pages (86% ff).
258M    Number of bytes free in tree leaf pages (60% ff).
0       Number of bytes free in tree duplicate pages (0% ff).
0       Number of bytes free in tree overflow pages (0% ff).

 I can't figure out why this happens. I will have to find out.
Even if we decide to create only one key entry for each word, 2 millions
URLs can contain over 10 million different words.

 > Sure enough, there was: secion 5.7 pp. 257-260 (2nd ed.). Basically, we'd
 > be using a modified B-Tree that keeps around blocks of free space for
 > expansion. They report that it seems to perform well, with less than 5%
 > waste and only 1-2 disk accesses per write. I'm guessing that the eventual
 > performance would be better than my current approach at large numbers of
 > words because the number of keys would be much smaller.

 Too bad I didn't bring the book with me. Tim A. H. Bell wrote a piece
of software it's worth looking at (Tim is *not* the Bell of MG book, he
is A. Moffat student). I'll send you a copy of his code. Unfortunately,
due to his university regulations, the copyright status of this code is
unclear and cannot be used as is. This may change in the near future since
Tim is hassling them to go GPL.
 
 I'll continue benchmarking to clarify what we can expect from db (space
and performances) for various usage patterns. I'm sure this will help us
to find a scalable solution. 

 > This would be worth a try since we could do standard index compression on
 > it. On the other hand, it would require rolling our own code for this,
 > which may or may not be useful. However, I think we should try this before
 > we muck around in the Berkeley code.

 I'm skeptical. Whatever we try I'm *sure* we must use the db code as
a basis. Implementing something similar is a very difficult task. Although
Tim code has interesting ideas, it is not scalable. Reasons for this:
1) he assumes that a list of postings can always be read completely 
in memory (as htdig does, btw :-) 2) he assumes that the list of words
and the list of document names (URLs for us) is small enough to fit in
memory 3) not thread safe, no transactions.

 Cheers,

-- 
                Loic Dachary

                ECILA
                100 av. du Gal Leclerc
                93500 Pantin - France
                Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
                e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
[htdig3-dev] Re: Theory of IR and current word index format

Reply via email to