Joe, Maybe we can take a step back first. Would it be better if your index was cleaner and didn't have flagged duplicates in the first place? If so, have you tried using http://wiki.apache.org/solr/Deduplication ?
Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR ----- Original Message ---- > From: Joe Calderon <calderon....@gmail.com> > To: solr-user@lucene.apache.org > Sent: Friday, July 31, 2009 5:06:48 PM > Subject: dealing with duplicates > > hello all, i have a collection of a few million documents; i have many > duplicates in this collection. they have been clustered with a simple > algorithm, i have a field called 'duplicate' which is 0 or 1 and a > fields called 'description, tags, meta', documents are clustered on > different criteria and the text i search against could be very > different among members of a cluster. > > im currently using a dismax handler to search across the text fields > with different boosts, and a filter query to restrict to masters > (duplicate: 0) > > my question is then, how do i best query for documents which are > masters OR match text but are not included in the matched set of > masters? > > does this make sense?