I'm on Nutch 0.7, and I just noticed recently that after
merging segments, a lot of URLs that I thought should be
there disappeared. I did a segread -dumpsort on the
original segments and on the merged segment and found
that I had lost 30% of my URLs.

Doing a diff on the url files, I found that some URLs were
even resurrected (they didn't show up in the original
segments, but showed up on the merged segment).

I checked the logs and there was one small corrupted segment
(not enough to account for the lost URLs), but mergesegs just
seemed to ignore it and go on.

I commented out the code in SegmentMergeTool.java that
had to do with deleting duplicates, and the problem went
away. I get the same set of URLs before and after merging.

My plan for now is to locally comment out this deletion
code, and use bin/nutch dedup on the merged index, but
I was wondering if anyone else has seen this problem in either
0.7 or 0.8. Any ideas on why it might be happening?

Thanks!
Howie


Reply via email to