On Tuesday, October 22, 2002, at 02:50 PM, Neal Richter wrote:
Nope. Remember that "value" as it currently stands is the anchor--if any. So if your documents don't have anchors defined, the value will never be entered. The key needs to be unique, so it has a structure:It looks to me like the db.words.db is using only a 'key' value, and has a blank 'value' for each and every key!
word // DocID // flags // location -> anchor
OK, remember what I said a while ago about the data structure here.A more efficient solution to make the index smaller would be this: key = 'people\02', value = '00\00\00\00\c9\cb\ce\00'
(a) Yes, it's "brain dead" for an traditional inverted index--but a traditional inverted index is done in 2 steps: index, then sort/invert. You have different problems if you're creating a "live" database in one step.
(b) B-Tree databases get pretty convoluted if you start growing the size of the value record as you index: i.e. I add a word and then start tacking on new DocIDs... This happens because the database essentially becomes fragmented in the same way that files can get fragmented over a hard drive.
So we take the above and think and performance test...
* While the Berkeley DB allows you to have duplicate keys, performance suffers some. IIRC this happens b/c you get non-uniform trees.
* You want to store as much as possible in the value field, *but* need to have a unique key.
OK, you could probably shove "flags" into the value, like so--but I'm not sure if that'll make a huge difference:
word // DocID // location -> flags // anchor
I'd be glad to go into more depth into why the current database structure has been set up this way. There is probably a better structure, like I outlined a few weeks ago. But you simply can't build an on-line inverted word index in one fell swoop and use a traditional design. (Google doesn't do it, the current 3.2 code has never attempted to do it...)
If you'd like, I can probably pull up some of the original message threads.
-Geoff
-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev
