[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637649#action_12637649
 ] 

Andrzej Bialecki  commented on SOLR-799:
----------------------------------------

Interesting development in light of NUTCH-442 :) Some comments:

* in MD5Signature I suggest using the code from 
org.apache.hadoop.io.MD5Hash.toString() instead of BigInteger.

* TextProfileSignature should contain a remark that it's copied from Nutch, 
since AFAIK the algorithm that it implements is currently used only in Nutch.

* in Nutch the concept of a page Signature is only a part of the deduplication 
process. The other part is the algorithm to decide which copy to keep and which 
one to discard. In your patch the latest update always removes all other 
documents with the same signature. IMHO this decision should be isolated into a 
DuplicateDeletePolicy class that gets all duplicates and can decide (based on 
arbitrary criteria) which one to keep, with the default implementation that 
simply keeps the latest document.

> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to