Hi,
That's really cool that Sleepycat wakes up ;-) It save a lot
of time. I've been thinking all day about the structure of the word
database. And I'm pretty much convinced that you got it right from the
beginning. There must be a node for every word occurence. And they must
be sorted. Storing postings in strings saves more space but makes things
much more complex for updating and scalability.
My tests show that storing an entry for every word leads to a
word index that is 100% of the indexed data, we already know that. If
we compress it (using full page compression or individual entry
compression) the reward is 50% space saving. And we are in the
reasonable/standard figure for a dynamic index. It saves more space
than prefix compression (I mean prefix compression on leaf nodes) and
has none if it's drawbacks (locking).
In order to efficiently use such an index I suggest that we have
the following structure for an entry:
key:
word\001\
<packed document id>\
<packed tag identifier (what you call flag)>\
<packed location>
data:
nothing (or word weight, if you insist)
The db_compare function will first compare word, if equal compare
document id, if equal compare tag identifier, if equal compare location. And
we have a multilevel sort. And we come back to my idea of hierachical
organization of the key. It's all there, we only have to make the db_compare
function general enough to read parameters that define how numbers after
wthe word are organized. If we need highly efficient db_compare, one could
be provided thru a shared library.
Unrolling the loop to the traditional word -> postings entry
organization, we indeed have something that is very similar. Each leaf
page of the db tree has a list of key/data entry (posting). The only
disturbing thing is that we repeat the word in each posting (that's
what bothered me initialy). But page compression or entry compression
reduces this overhead to a reasonable figure. If we assume that we can
have an index that is 50% of the size of the data and that the optimal
rate would be 30%, we loose 20%. But we win a dynamic index. 0.6% is
for the internal pages of BTree. I have not calculated how the 19.4%
are shared between administrative data within a page and word
redundancy.
I'll start experimenting with compression tomorrow because
it's the last point that can bring us bad surprises. Regarding balancing
and page ordering I'm sure implementing it won't be a problem. We will
just have to clearly specify that no index update must occur during
balancing.
Regarding boolean query resolution, walking lists of words occurences
becomes easy and efficient. When resolving a query we can open as many
cursors as there are distinct words in the query and walk them in parallel
to solve the query in a time/space that is *not* always proportional to
the most frequent word. Getting the next 10 matches will be as fast as
getting the first 10 matches by only providing the last document id returned
by the previous query : we set the cursors of each word using a partial match
and we're ready. Speaking of that I'll have to check with Keith if cursors
are not too expensive (they get updated when the index is modified, I should
ask what's the overhead). I also understand that this will need to rework
the current query resolution code since the parsing is mixed with the
resolution. I'm sure it can be done efficiently.
Let me know if you like the key organization, please.
Have fun,
--
Loic Dachary
ECILA
100 av. du Gal Leclerc
93500 Pantin - France
Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.