Re: [htdig-dev] db.words.db using only key and empty value?

Gilles Detillieux Sat, 02 Nov 2002 18:33:58 -0800

According to Neal Richter:
> On Tue, 22 Oct 2002, Geoff Hutchison wrote:
> > On Tuesday, October 22, 2002, at 02:50  PM, Neal Richter wrote:
> >
> > > It looks to me like the db.words.db is using only a 'key' value, and has
> > > a blank 'value' for each and every key!
> >
> > Nope. Remember that "value" as it currently stands is the anchor--if
> > any. So if your documents don't have anchors defined, the value will
> > never be entered. The key needs to be unique, so it has a structure:
> >
> > word // DocID // flags // location -> anchor
> 
>   Could you give an example where the anchor won't be empty?  I tried a
> couple things and couldn't get one.


"anchor" won't be empty if "word" occurs after a <a name="foo..."> tag.
The db.docdb record for DocID has a list of all anchor names encountered,
so "anchor" in the db.words.db record should be an index into that list.

> > > A more efficient solution to make the index smaller would be this:
> > > key = 'people\02', value = '00\00\00\00\c9\cb\ce\00'
> >
> > OK, remember what I said a while ago about the data structure here.
> > (a) Yes, it's "brain dead" for an traditional inverted index--but a
> > traditional inverted index is done in 2 steps: index, then sort/invert.
> > You have different problems if you're creating a "live" database in one
> > step.
> > (b) B-Tree databases get pretty convoluted if you start growing the size
> > of the value record as you index: i.e. I add a word and then start
> > tacking on new DocIDs... This happens because the database essentially
> > becomes fragmented in the same way that files can get fragmented over a
> > hard drive.
> 
>   Now I understand what you meant ;-).  I was confused since my conceptual
> model was using the value for word-num and flags.
> 
>   Would you explain your previous idea in more detail?

How much of this database fragmentation would be due to the fact that
there are records of different lengths, and how much would be due to
updating a given record from one length to a larger length.

E.g., if instead of having a whole bunch of entries like this...

  word // DocID // flags // location -> anchor

what if we had entries like this...

  word // DocID -> flags/location/anchor flags/location/anchor ...

but instead of making database updates each time another word is parsed
(as is done now in 3.2, if I'm not mistaken), how about if htdig stored
this information in memory as it did in 3.1, and then just dumped
out all the records like above after the whole document is parsed.
That way, none of the records ever have to be updated and lengthened.
They're just written once.

In an update dig, the DocID for a given URL changes anyway if the document
is reparsed, so we'd end up deleting all the ones with the old DocID and
adding some with the new one.  Would this lead to too much fragmentation
due to the different record lengths?

Am I oversimplifying things, or would something like this help with
htdig's performance while indexing?

> > So we take the above and think and performance test...
> > * While the Berkeley DB allows you to have duplicate keys, performance
> > suffers some. IIRC this happens b/c you get non-uniform trees.
> > * You want to store as much as possible in the value field, *but* need
> > to have a unique key.
> 
>   I think we can avoid duplicate keys with this scenario:
> 
> ['people' is in a document 9 times]
> 
> [Row 1]
> key =  people\02\00\00\00\00\c9\00    value == \00\c9\00\cb\00\ce\00\d1
> 
> [Row 2]
> key =  people\02\00\00\00\00\d4\00    value == \00\d4\00\df\00\ee\00\00
> 
> I'm not sure where it gets implemented, but a row isn't written until
> there are X references to it or until we change documents.   This would
> involve a cache of rows in memory, which is probably in the classes
> somewhere.
> 
> This avoids updating previously written rows and keeps the value size to a
> known quantity.

I think even optimizations like this become easier if we don't dump out
any of the db.words.db records until a document is fully parsed, and then
dump them all out at once.  Am I wrong?  I know that 3.2 is supposed to
allow indexing a live database on the fly, and still have it be searchable,
but that doesn't mean the DB needs to be updated a word at a time.  Doing
it a document at a time should make sense, just as db.docdb is updated.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm 
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] db.words.db using only key and empty value?

Reply via email to