Re: dealing with duplicates

Otis Gospodnetic Fri, 31 Jul 2009 23:33:16 -0700

Joe,

Maybe we can take a step back first.  Would it be better if your index was 
cleaner and didn't have flagged duplicates in the first place?  If so, have you 
tried using http://wiki.apache.org/solr/Deduplication ?


 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Joe Calderon <calderon....@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, July 31, 2009 5:06:48 PM
> Subject: dealing with duplicates
> 
> hello all, i have a collection of a few million documents; i have many
> duplicates in this collection. they have been clustered with a simple
> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
> fields called 'description, tags, meta', documents are clustered on
> different criteria and the text i search against could be very
> different among members of a cluster.
> 
> im currently using a dismax handler to search across the text fields
> with different boosts, and a filter query to restrict to masters
> (duplicate: 0)
> 
> my question is then, how do i best query for documents which are
> masters OR match text but are not included in the matched set of
> masters?
> 
> does this make sense?

Re: dealing with duplicates

Reply via email to