Hi,
    I've been using htdig (3.2.0b2) without any problems for some time.
Problems are coming now, as I'm beginning to add more sites to the index

merging them together by means of  htmerge -m option.

It works, but it uses a lot of memory (with my setup the process grew up

to beyond 150Mb). So I tried to track down the memory usage and I found
that two peaks were related to the two calls :

    words = mergeWordDB.WordRefs();
and
    words = wordDB.WordRefs();

found in htmerge/db.cc.
I think these methods load into a list (words) into memory the whole
content
of the word database. The list is then walked sequentially performing
some actions on its elements and then released.

Looking at the code in htword/WordList.cc I found that a callback
interface
is available by means of the class WordSearchDescription that allows to
execute any action looping on the database elements without loading them

into memory.
So I put together the attached patch vs htdig-3.2.0b2. From my few tests

It seems to work, reduces memory requirements to a 50% and does not
raise execution time.

Am I missing anything ?


By the way, while studying the code I found what I think is a bug.
In htmerge/db.cc  the following code :

    String docIDKey;
    ...
    docIDKey = word->DocID();
    if (merge_dup_ids.Exists(docIDKey))
    ...

the  DocID() method returns an int, but the class String has not an
overloaded = operator converting an int to a string (which I suppose is
the intended operation) so I think it uses the constructor
String::String(int init)
which allocates an empty string. Let me know if I'm wrong.

Best Regards,
    Lorenzo Campedelli

htmerge.patch

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to