[
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639456#action_12639456
]
Otis Gospodnetic commented on SOLR-799:
---------------------------------------
Haven't looked at the patch yet.
Have looked at the Deduplication wiki page (and realize the stuff I'll write
below is briefly mentioned there).
Have skimmed the above comments.
I want to bring up the use case that seems to have been mentioned already, but
only in passing. The focus of the previous comments seems to be on index-time
duplication detection. Another huge use case is search-time near-duplicate
detection. Sometimes it's about straight forward field collapsing (collapsing
adjacent docs with identical values in some field), but sometimes it's more
complicated.
For example, sometimes multiple fields need to be compared. Sometimes they
have to be identical for collapsing to happen. Sometimes they only need to be
"similar". How similarity is calculated is very application-dependent. I
believe this similarity computation has to be completely
open/extensible/overridable, allowing one to write a custom search component,
examine returned hits and compare them using app-specific similarity....
Ideally one would have the option not to save the document/field at index-time
(for examination at search-time), since that prevents one from experimenting
and dynamically changing the similarity computation.
Here is one example. Imagine a field called "IDs" that can have 1 or more
tokens in it and imagine docs with the following "IDs" get returned:
1) id:aaa
2) id:bbb
3) id:ccc ddd
4) id:aaa bbb
5) id:eee ddd
6) id:aaa
A custom similarity may look at all of the above (e.g. a page's worth of hits)
and decide that:
1) and 4) are similar
2) and 4) are also similar
3) and 5) are similar
1) and 4) and 6) are similar
Another custom similarity may say that only 1) and 6) are similar because they
are identical.
My point is really that we have to leave it up to the application to provide
similarity implementation, just like we make it possible for the app to provide
a custom Lucene Similarity.
Is the goal of this issue to make this possible?
> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
> Key: SOLR-799
> URL: https://issues.apache.org/jira/browse/SOLR-799
> Project: Solr
> Issue Type: New Feature
> Components: update
> Reporter: Mark Miller
> Priority: Minor
> Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking
> as well as field collapsing. Lets put it into solr.
> http://wiki.apache.org/solr/Deduplication
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.