dealing with duplicates

Joe Calderon Fri, 31 Jul 2009 14:07:18 -0700

hello all, i have a collection of a few million documents; i have many
duplicates in this collection. they have been clustered with a simple
algorithm, i have a field called 'duplicate' which is 0 or 1 and a
fields called 'description, tags, meta', documents are clustered on
different criteria and the text i search against could be very
different among members of a cluster.


im currently using a dismax handler to search across the text fields
with different boosts, and a filter query to restrict to masters
(duplicate: 0)

my question is then, how do i best query for documents which are
masters OR match text but are not included in the matched set of
masters?

does this make sense?

dealing with duplicates

Reply via email to