[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Hoss Man (JIRA) Thu, 09 Oct 2008 16:11:38 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638426#action_12638426
 ]


Hoss Man commented on SOLR-799:
-------------------------------

(disclaimer: haven't looked at the patch)

bq. Though in some implementations (like #2, which may be the default), 
detecting that duplicate and handling it are truly coupled... forcing a 
decoupling would not be a good thing in that case.

I don't follow your reasoning.  all the usecases i've seen mentioned seem like 
they could/would decouple very nicely...

1. Prevent new insert -- SignatureUpdateProcessor generates a signature and 
adds it as a field; AbortIfExistingUpdateProcessor aborts the update if a doc 
exists with a specific field in common with the doc to be added.
2. Remove old (i.e. same as an update works now) -- SignatureUpdateProcessor as 
mentioned before, and signature field is used as the uniqueKey field.
3. Note the duplicate on the existing document in a "duplicates" field -- 
SignatureUpdateProcessor as mentioned before; AnnotateDuplicatesProcessor 
checks for existing docs with a specific field in common with the doc to be 
added and executes additional opperations to "udpate" those docs, as well as 
the doc to be added.


> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Reply via email to