Thanks for the response, Andrzej.
Mergesegs also performs dedup. If you compare the list of urls in the index based on the original input segments, but AFTER dedup, and in the index built from the merged segment, are they different?
I should have specified. I didn't run index after merging. I just did bin/nutch mergesegs -dir mydb/segments (no -i or -ds options). Then I immediately do a segread on the new merged segment. And the list of URLs are different -- mostly missing URLs, but also some "new" URLs. I find the addition of new URLs in the merged segments especially puzzling. Where do they come from? Is segread lying to me about what's in the original segments? I checked the segread output on the deleted URLs and I don't find anything strange in their status. I have a feeling that the mergesegs dedup is what is causing the problem since when I commented out this code, the list of urls is the same before and after merging. It's possible that I have some sort of corruption in the original segments that is causing unpredictable behavior in the mergesegs dedup code.
Could you perhaps provide a minimal fetchlist + exact steps you took, to illustrate and reproduce the problem?
I don't have a minimal fetchlist right now. I'll see if I can get one together. I wouldn't be surprised if the problem only occurred after getting a significant number of pages. Thanks, Howie
