I'm on Nutch 0.7, and I just noticed recently that after merging segments, a lot of URLs that I thought should be there disappeared. I did a segread -dumpsort on the original segments and on the merged segment and found that I had lost 30% of my URLs.
Doing a diff on the url files, I found that some URLs were even resurrected (they didn't show up in the original segments, but showed up on the merged segment). I checked the logs and there was one small corrupted segment (not enough to account for the lost URLs), but mergesegs just seemed to ignore it and go on. I commented out the code in SegmentMergeTool.java that had to do with deleting duplicates, and the problem went away. I get the same set of URLs before and after merging. My plan for now is to locally comment out this deletion code, and use bin/nutch dedup on the merged index, but I was wondering if anyone else has seen this problem in either 0.7 or 0.8. Any ideas on why it might be happening? Thanks! Howie
