Re: [htdig-dev] db.words.db using only key and empty value?

Geoff Hutchison Mon, 04 Nov 2002 07:01:23 -0800

Sorry, I've been really busy and haven't had much time to comment on this.

On Saturday, November 2, 2002, at 08:29 PM, Gilles Detillieux wrote:

How much of this database fragmentation would be due to the fact that
there are records of different lengths, and how much would be due to
updating a given record from one length to a larger length.

It's the latter.

this information in memory as it did in 3.1, and then just dumped
out all the records like above after the whole document is parsed.
That way, none of the records ever have to be updated and lengthened.

No, words are only sent out once the document is parsed. It's not writing "along the way" if you will. But remember that while the document itself has the vast majority of the words, we still would have to update the records when we're indexing another document and happen to hit a link back to the first one. :-(

I implemented this type of caching very quickly using the STL with
slight modifications to the WordDB object. Mifluz contains a WordDBCache
object, but 3.2 doesn't use it, and it's excessively complicated in
my opinion.

Basically, it does two things. One, it caches up to a certain amount of data--and this is already used by 3.2 by default. Two, the mifluz code will write out small temporary files along the way. As I wrote the mifluz-merge code, it collapsed the temporary files every 10 documents (and I've checked performance by varying that number). There is a decided performance improvement because the data can be stuffed back into the main DB with better ordering. There's also better performance on the initial writes because you're adding to a temporary file with a smaller lookup penalty.

Do we REALLY need this function to goto the complexity of calling 'Unpack'
and comparing the keys? Why not treat the entire key bitstream after the
word-string as a binary compare and return?

That depends on how the packing is done. Udi Manber (among others probably) outlined various strategies for comparing compressed strings w/o unpacking. In this case, I think as long as you have a one-to-one mapping and you're consistently doing comparisons for *every* key, you're fine. The ordering will be different on the compressed strings, but as long as everything is unique, I can't think of a problem.

-Geoff

-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] db.words.db using only key and empty value?

Reply via email to