Hello everybody,
while doing some tests for a repeating dig-merge script,
I found what I think is a bug in the merge phase :
If you try to merge a database together a copy of itself
i.e. something like :
htmerge -c db -m db.copy
you obtain a much bigger db.words.db.
While investigating this problem, I found that the check
for old duplicates is performed the wrong way, i.e. if
the new entry has the same date as the one in the database,
instead of discarding the new one, the existing one is
removed from the database, and the new one is added.
The code is the following in htmerge/db.cc
if ( old_ref->DocTime() > ref->DocTime() )
{
// Cool, the ref we're merging is too old, just ignore it
Now, the ">" should become a ">=".
This fix the merge of documents having duplicate entries:
I use to perform an incremental dig of a single site and then
to merge it into a global database for many sites, so duplicate
URLs are rather frequent.
An open question is why merging the same database the old
way gives me a bigger (up to 10 times) db.words.db ...
Any ideas ???
Attached you can find the patch (htmerge1.patch), but as it is
based on a previous patch of mine on the same file (htmerge.patch,
the one which reduced a lot memory usage in the merging phase)
I will attach the latter also. Both are for htdig-3.2.0b2
Best regards,
Lorenzo
htmerge1.patch
htmerge.patch
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.