[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Otis Gospodnetic (JIRA) Tue, 14 Oct 2008 08:42:36 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639456#action_12639456
 ]


Otis Gospodnetic commented on SOLR-799:
---------------------------------------

Haven't looked at the patch yet.
Have looked at the Deduplication wiki page (and realize the stuff I'll write 
below is briefly mentioned there).
Have skimmed the above comments.

I want to bring up the use case that seems to have been mentioned already, but 
only in passing.  The focus of the previous comments seems to be on index-time 
duplication detection.  Another huge use case is search-time near-duplicate 
detection.  Sometimes it's about straight forward field collapsing (collapsing 
adjacent docs with identical values in some field), but sometimes it's more 
complicated.

For example, sometimes multiple fields need to be compared.  Sometimes they 
have to be identical for collapsing to happen.  Sometimes they only need to be 
"similar".  How similarity is calculated is very application-dependent.  I 
believe this similarity computation has to be completely 
open/extensible/overridable, allowing one to write a custom search component, 
examine returned hits and compare them using app-specific similarity....

Ideally one would have the option not to save the document/field at index-time 
(for examination at search-time), since that prevents one from experimenting 
and dynamically changing the similarity computation.

Here is one example.  Imagine a field called "IDs" that can have 1 or more 
tokens in it and imagine docs with the following "IDs" get returned:

1) id:aaa
2) id:bbb
3) id:ccc ddd
4) id:aaa bbb
5) id:eee ddd
6) id:aaa

A custom similarity may look at all of the above (e.g. a page's worth of hits) 
and decide that:
1) and 4) are similar
2) and 4) are also similar
3) and 5) are similar
1) and 4) and 6) are similar

Another custom similarity may say that only 1) and 6) are similar because they 
are identical.

My point is really that we have to leave it up to the application to provide 
similarity implementation, just like we make it possible for the app to provide 
a custom Lucene Similarity.

Is the goal of this issue to make this possible?


> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Reply via email to