On Saturday, November 2, 2002, at 08:29 PM, Gilles Detillieux wrote:
How much of this database fragmentation would be due to the fact that there are records of different lengths, and how much would be due to updating a given record from one length to a larger length.
It's the latter.
No, words are only sent out once the document is parsed. It's not writing "along the way" if you will. But remember that while the document itself has the vast majority of the words, we still would have to update the records when we're indexing another document and happen to hit a link back to the first one. :-(this information in memory as it did in 3.1, and then just dumped out all the records like above after the whole document is parsed. That way, none of the records ever have to be updated and lengthened.
I implemented this type of caching very quickly using the STL withBasically, it does two things. One, it caches up to a certain amount of data--and this is already used by 3.2 by default. Two, the mifluz code will write out small temporary files along the way. As I wrote the mifluz-merge code, it collapsed the temporary files every 10 documents (and I've checked performance by varying that number). There is a decided performance improvement because the data can be stuffed back into the main DB with better ordering. There's also better performance on the initial writes because you're adding to a temporary file with a smaller lookup penalty.
slight modifications to the WordDB object. Mifluz contains a WordDBCache
object, but 3.2 doesn't use it, and it's excessively complicated in
my opinion.
Do we REALLY need this function to goto the complexity of calling 'Unpack'That depends on how the packing is done. Udi Manber (among others probably) outlined various strategies for comparing compressed strings w/o unpacking. In this case, I think as long as you have a one-to-one mapping and you're consistently doing comparisons for *every* key, you're fine. The ordering will be different on the compressed strings, but as long as everything is unique, I can't think of a problem.
and comparing the keys? Why not treat the entire key bitstream after the
word-string as a binary compare and return?
-Geoff
-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev
