Sorry, I've only briefly looked at Nutch, so you should ask on that mailing list. Lucene doesn't do deduping.
-Yonik Now hiring -- http://tinyurl.com/7m67g On 10/14/05, Michael Ji <[EMAIL PROTECTED]> wrote: > > hi Yonik: > > Does that mean when two documents has same MD5 content > in two different segments, IndexMerger.java will keep > both of them? > > When I look at the code of IndexSegment.java, it > handle MD5 dedupling by keeping the one with higher > document ID. >
