RE: [htdig] HTMerge memory problem

Sean Downey Thu, 03 Oct 2002 07:12:44 -0700

Hi Geoff, Gilles

have ye had a chance to look at this problem since?
nobody could fix it here :-(

thanks
Sean

-----Original Message-----
From: Geoff Hutchison 
Sent: Tuesday, July 02, 2002 5:55 PM
To: Sean Downey
Cc: Gilles Detillieux
Subject: RE: [htdig] HTMerge memory problem

> Is it a problem that could be explained and is it confined to a few code
> files??

It's definitely confined to one file: httools/htmerge.cc.

Nothing else will need to change, only the code there.

In the code, the "merge" prefix refers to the database being merged into
the other. So mergeWordDB would be the word database being merged into
wordDB.

I'll do my best to explain. Basically, the current htmerge code grabs a
List of all URLs in both databases and figures out duplicates. Then it
constructs the "merged" list of URLs. This eats some memory, but it's not
quite as bad as the next bit.

The big memory hog starts with:
    // OK, after merging the doc DBs, we do the same for the words

then you'll see this, which is what's really bad:
(actually just noticed the comment before this says "URLs" when it should
say "words")
    // Start the merging by going through all the URLs that are in
    // the database to be merged

    words = mergeWordDB.WordRefs();

so then the code loops through and checks the DocIDs for each word--if
they're duplicates that we should ignore, it keeps going. Otherwise, it
adds it to the other database (with a new DocID).

Finally, 
    words = wordDB.WordRefs();

Now it loops through the target DB (i.e. the one that received
everything) and deletes words that are in duplicate documents--i.e. they
were made obsolete by the mergeWordDB.

OK, documentation for htword/mifluz can be found at:

http://www.gnu.org/software/mifluz/doc.en.html
It's actually for a newer version of mifluz than is currently used by
ht://Dig. That version would be 0.14. But most of the API is
similar. Obviously see the headers in htword/ for the exact details. :-)

For the loop before deletion, you'll want to use the WordList::Cursor
methods to loop--you'll need to set up a callback as the previous patch
did too. The callback function would add the words from the mergeWordDB.

For the next loop, you'll want to use the WalkDelete method from the
WordList object to delete the words (rather than constructing a full list
in memory all at once!).

Hopefully this makes some sense. I'll be around for about another 2-3
hours.

-Geoff

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

RE: [htdig] HTMerge memory problem

Reply via email to