[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639456#action_12639456 ]
Otis Gospodnetic commented on SOLR-799: --------------------------------------- Haven't looked at the patch yet. Have looked at the Deduplication wiki page (and realize the stuff I'll write below is briefly mentioned there). Have skimmed the above comments. I want to bring up the use case that seems to have been mentioned already, but only in passing. The focus of the previous comments seems to be on index-time duplication detection. Another huge use case is search-time near-duplicate detection. Sometimes it's about straight forward field collapsing (collapsing adjacent docs with identical values in some field), but sometimes it's more complicated. For example, sometimes multiple fields need to be compared. Sometimes they have to be identical for collapsing to happen. Sometimes they only need to be "similar". How similarity is calculated is very application-dependent. I believe this similarity computation has to be completely open/extensible/overridable, allowing one to write a custom search component, examine returned hits and compare them using app-specific similarity.... Ideally one would have the option not to save the document/field at index-time (for examination at search-time), since that prevents one from experimenting and dynamically changing the similarity computation. Here is one example. Imagine a field called "IDs" that can have 1 or more tokens in it and imagine docs with the following "IDs" get returned: 1) id:aaa 2) id:bbb 3) id:ccc ddd 4) id:aaa bbb 5) id:eee ddd 6) id:aaa A custom similarity may look at all of the above (e.g. a page's worth of hits) and decide that: 1) and 4) are similar 2) and 4) are also similar 3) and 5) are similar 1) and 4) and 6) are similar Another custom similarity may say that only 1) and 6) are similar because they are identical. My point is really that we have to leave it up to the application to provide similarity implementation, just like we make it possible for the app to provide a custom Lucene Similarity. Is the goal of this issue to make this possible? > Add support for hash based exact/near duplicate document handling > ----------------------------------------------------------------- > > Key: SOLR-799 > URL: https://issues.apache.org/jira/browse/SOLR-799 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Mark Miller > Priority: Minor > Attachments: SOLR-799.patch > > > Hash based duplicate document detection is efficient and allows for blocking > as well as field collapsing. Lets put it into solr. > http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.