Hello! We are running Nutch and SolrCloud cluster - we index our data to SolrCloud. The problem is that we have duplicates and we would like to get rid of those. To put it simple, we would like to remove documents with the same content (they can have slightly different URL addresses).
I know that there is Solr based deduplication that would work here (http://wiki.apache.org/solr/Deduplication), but the problem is that it doesn't work with SolrCloud. Maybe there is another way that would allow us to remove duplicates based on the content of the documents? Any help would be appreciated. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch

