Nutch, SolrCloud and deduplication

Rafał Kuć Wed, 04 Dec 2013 06:09:45 -0800

Hello!

We are running Nutch and SolrCloud cluster - we index our data to
SolrCloud. The problem is that we have duplicates and we would like to
get rid of those. To put it simple, we would like to remove documents
with the same content (they can have slightly different URL
addresses).


I know that there is Solr based deduplication that would work here
(http://wiki.apache.org/solr/Deduplication), but the problem is that
it doesn't work with SolrCloud.

Maybe there is another way that would allow us to remove duplicates
based on the content of the documents?

Any help would be appreciated.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch

Nutch, SolrCloud and deduplication

Reply via email to