[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Grant Ingersoll (JIRA) Wed, 08 Oct 2008 08:38:36 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637965#action_12637965
 ]


Grant Ingersoll commented on SOLR-799:
--------------------------------------

Haven't looked at the patch, but I agree that it is wise to separate the 
detection of duplication from the handling of found duplicates.  The default 
can be to remove all as in the patch, but it should be easy to override.  
Scenarios I can see being useful:
1. Prevent new insert
2. Remove old (i.e. same as an update works now)
3.  Note the duplicate on the existing document in a "duplicates" field.  This 
obviously requires either deleting and re-adding the doc, or Lucene to better 
support appending/updating fields, maybe via the column-stride payloads (if 
that ever happens).  No need for this anytime soon.


> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Reply via email to