According to Neal Richter: > On Tue, 22 Oct 2002, Geoff Hutchison wrote: > > On Tuesday, October 22, 2002, at 02:50 PM, Neal Richter wrote: > > > > > It looks to me like the db.words.db is using only a 'key' value, and has > > > a blank 'value' for each and every key! > > > > Nope. Remember that "value" as it currently stands is the anchor--if > > any. So if your documents don't have anchors defined, the value will > > never be entered. The key needs to be unique, so it has a structure: > > > > word // DocID // flags // location -> anchor > > Could you give an example where the anchor won't be empty? I tried a > couple things and couldn't get one.
"anchor" won't be empty if "word" occurs after a <a name="foo..."> tag. The db.docdb record for DocID has a list of all anchor names encountered, so "anchor" in the db.words.db record should be an index into that list. > > > A more efficient solution to make the index smaller would be this: > > > key = 'people\02', value = '00\00\00\00\c9\cb\ce\00' > > > > OK, remember what I said a while ago about the data structure here. > > (a) Yes, it's "brain dead" for an traditional inverted index--but a > > traditional inverted index is done in 2 steps: index, then sort/invert. > > You have different problems if you're creating a "live" database in one > > step. > > (b) B-Tree databases get pretty convoluted if you start growing the size > > of the value record as you index: i.e. I add a word and then start > > tacking on new DocIDs... This happens because the database essentially > > becomes fragmented in the same way that files can get fragmented over a > > hard drive. > > Now I understand what you meant ;-). I was confused since my conceptual > model was using the value for word-num and flags. > > Would you explain your previous idea in more detail? How much of this database fragmentation would be due to the fact that there are records of different lengths, and how much would be due to updating a given record from one length to a larger length. E.g., if instead of having a whole bunch of entries like this... word // DocID // flags // location -> anchor what if we had entries like this... word // DocID -> flags/location/anchor flags/location/anchor ... but instead of making database updates each time another word is parsed (as is done now in 3.2, if I'm not mistaken), how about if htdig stored this information in memory as it did in 3.1, and then just dumped out all the records like above after the whole document is parsed. That way, none of the records ever have to be updated and lengthened. They're just written once. In an update dig, the DocID for a given URL changes anyway if the document is reparsed, so we'd end up deleting all the ones with the old DocID and adding some with the new one. Would this lead to too much fragmentation due to the different record lengths? Am I oversimplifying things, or would something like this help with htdig's performance while indexing? > > So we take the above and think and performance test... > > * While the Berkeley DB allows you to have duplicate keys, performance > > suffers some. IIRC this happens b/c you get non-uniform trees. > > * You want to store as much as possible in the value field, *but* need > > to have a unique key. > > I think we can avoid duplicate keys with this scenario: > > ['people' is in a document 9 times] > > [Row 1] > key = people\02\00\00\00\00\c9\00 value == \00\c9\00\cb\00\ce\00\d1 > > [Row 2] > key = people\02\00\00\00\00\d4\00 value == \00\d4\00\df\00\ee\00\00 > > I'm not sure where it gets implemented, but a row isn't written until > there are X references to it or until we change documents. This would > involve a cache of rows in memory, which is probably in the classes > somewhere. > > This avoids updating previously written rows and keeps the value size to a > known quantity. I think even optimizations like this become easier if we don't dump out any of the db.words.db records until a document is fully parsed, and then dump them all out at once. Am I wrong? I know that 3.2 is supposed to allow indexing a live database on the fly, and still have it be searchable, but that doesn't mean the DB needs to be updated a word at a time. Doing it a document at a time should make sense, just as db.docdb is updated. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by: See the NEW Palm Tungsten T handheld. Power & Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
