Howie Wang wrote:
I'm on Nutch 0.7, and I just noticed recently that after
merging segments, a lot of URLs that I thought should be
there disappeared. I did a segread -dumpsort on the
original segments and on the merged segment and found
that I had lost 30% of my URLs.

Doing a diff on the url files, I found that some URLs were
even resurrected (they didn't show up in the original
segments, but showed up on the merged segment).

I checked the logs and there was one small corrupted segment
(not enough to account for the lost URLs), but mergesegs just
seemed to ignore it and go on.

I commented out the code in SegmentMergeTool.java that
had to do with deleting duplicates, and the problem went
away. I get the same set of URLs before and after merging.

My plan for now is to locally comment out this deletion
code, and use bin/nutch dedup on the merged index, but
I was wondering if anyone else has seen this problem in either
0.7 or 0.8. Any ideas on why it might be happening?

Mergesegs also performs dedup. If you compare the list of urls in the index based on the original input segments, but AFTER dedup, and in the index built from the merged segment, are they different?

Could you perhaps provide a minimal fetchlist + exact steps you took, to illustrate and reproduce the problem?

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to