Doug Cutting wrote:
Shailesh Kochhar wrote:
I not very familiar with the Nutch API though I know there's a MD5
signature based deduping method in place and a Signature class to
extend for offline duplicate detection. I was wondering if anyone had
tried search time deduping and what would be good places to try and
implement it.
Nutch already does search-time deduping. By default it limits things to
two hits per host, but you can dedup by other fields and with other
per-dup counts. This is available through NutchBean:
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/NutchBean.html#search(org.apache.nutch.searcher.Query,%20int,%20int,%20java.lang.String)
and though the OpenSearch servlet.
If I understand this correctly, you can only dedup by one field. This
would mean that if you were to implement and use content-based
deduplication, you'd have to give up limiting the number of hits per host.
Is this correct, or did I miss something?
- Shailesh