At 5:16 PM -0400 6/7/01, Marcio Marchini wrote:
>       For instance, is it keyed by word and the associated value is a list
>of document pointers ? Do you use numbers to identify these doc.
>pointers ? Do you do any compression of the list, like just storing
>the delta value from the previous number/pointer, and then using
>variable-bit encoding to represent these deltas ? etc.

Unfortunately, the Berkeley DB (in particular the B-Tree) is a bit 
tricky to use. Obviously you'd like to keep the value as a list of 
location pointers (word position and doc ID) and compress them 
essentially as you mention. But then if you're building the database 
on-the-fly, you get a horrible speed hit as you try to replace lists 
as you add on new documents.

So the htword/mifluz code uses a more unconventional approach--store 
all the words as separate keys (ugh!) but compress the keys based on 
prefix. So if we have a node "htdig" and then "htdig.org" under it, 
the latter can be stored something like as "#.org"

There are some additional tricks thrown in to deal with keeping the 
level of branching managable, etc.

-- 
--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to