Re: Customzing Solr Dedupe

Jack Krupansky Wed, 01 Apr 2015 04:07:02 -0700

Solr dedupe is based on the concept of a signature - some fields and rules
that reduce a document into a discrete signature, and then checking if that
signature exists as a document key that can be looked up quickly in the
index. That's the conceptual basis. It is not based on any kind of field by
field comparison to all existing documents.


-- Jack Krupansky

On Wed, Apr 1, 2015 at 6:35 AM, thakkar.aayush <thakkar.aay...@gmail.com>
wrote:

> I'm facing a challenges using de-dupliation of Solr documents.
>
> De-duplicate is done using TextProfileSignature with following parameters:
> <str name="fields">field1, field2, field3</str>
> <str name="quantRate">0.5</str>
> <str name="minTokenLen">3</str>
>
> Here Field3 is normal text with few lines of data.
> Field1 and Field2 can contain upto 5 or 6 words of data.
>
> I want to de-duplicate when data in field1 and field2 are exactly the same
> and 90% of the lines in field3 is matched to that in another document.
>
> Is there anyway to achieve this?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Customzing-Solr-Dedupe-tp4196879.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Customzing Solr Dedupe

Reply via email to