Hello everybody,
while doing some tests for a repeating dig-merge script,
I found what I think is a bug in the merge phase :

If you try to merge a database together a copy of itself
i.e. something like :

        htmerge -c db -m db.copy

you obtain a much bigger db.words.db.
While investigating this problem, I found that the check
for old duplicates is performed the wrong way, i.e. if
the new entry has the same date as the one in the database,
instead of discarding the new one, the existing one is
removed from the database, and the new one is added.

The code is the following in htmerge/db.cc

if ( old_ref->DocTime() > ref->DocTime() )
        {
        // Cool, the ref we're merging is too old, just ignore it

Now, the ">" should become a ">=".

This fix the merge of documents having duplicate entries:
I use to perform an incremental dig of a single site and then
to merge it into a global database for many sites, so duplicate
URLs are rather frequent.

An open question is why merging the same database the old
way gives me a bigger (up to 10 times) db.words.db ...
Any ideas ???

Attached you can find the patch (htmerge1.patch), but as it is
based on a previous patch of mine on the same file (htmerge.patch,
the one which reduced a lot memory usage in the merging phase)
I will attach the latter also. Both are for htdig-3.2.0b2

Best regards,
    Lorenzo

htmerge1.patch

htmerge.patch

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to