Nutch 1.8 is close to being released and has better on board deduplication than 
it had before, it now works independently of indexing backend, see NUTCH-656. 
It will work with SolrCloud. Another method it to collapse on the doc's indexed 
digest but it requires sharding by digest and distributed stats to make up for 
the unbalanced shards, see SOLR-1632 for a working patch.

-----Original message-----
> From:Rafał Kuć <[email protected]>
> Sent: Wednesday 4th December 2013 15:09
> To: [email protected]
> Subject: Nutch, SolrCloud and deduplication
> 
> Hello!
> 
> We are running Nutch and SolrCloud cluster - we index our data to
> SolrCloud. The problem is that we have duplicates and we would like to
> get rid of those. To put it simple, we would like to remove documents
> with the same content (they can have slightly different URL
> addresses).
> 
> I know that there is Solr based deduplication that would work here
> (http://wiki.apache.org/solr/Deduplication), but the problem is that
> it doesn't work with SolrCloud.
> 
> Maybe there is another way that would allow us to remove duplicates
> based on the content of the documents?
> 
> Any help would be appreciated.
> 
> -- 
> Regards,
>  Rafał Kuć
>  Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch
> 
> 

Reply via email to