Thanks for the response, Andrzej.

Mergesegs also performs dedup. If you compare the list of urls in the index based on the original input segments, but AFTER dedup, and in the index built from the merged segment, are they different?

I should have specified. I didn't run index after merging. I just did
bin/nutch mergesegs -dir mydb/segments (no -i or -ds options).
Then I immediately do a segread on the new merged segment.
And the list of URLs are different -- mostly missing URLs, but also
some "new" URLs.

I find the addition of new URLs in the merged segments especially
puzzling. Where do they come from? Is segread lying to me about
what's in the original segments?

I checked the segread output on the deleted URLs and I don't
find anything strange in their status.

I have a feeling that the mergesegs dedup is what is causing the
problem since when I commented out this code, the list of urls
is the same before and after merging. It's possible that I have
some sort of corruption in the original segments that is causing
unpredictable behavior in the mergesegs dedup code.

Could you perhaps provide a minimal fetchlist + exact steps you took, to illustrate and reproduce the problem?

I don't have a minimal fetchlist right now. I'll see if I can get one
together. I wouldn't be surprised if the problem only occurred after
getting a significant number of pages.

Thanks,
Howie


Reply via email to