[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525761
 ] 

Brian Mertens commented on SOLR-236:
------------------------------------

Imagine a case where a Solr database contains news stories from many newspapers 
and some wire services.

A single wire story will typically be picked up and reprinted in many different 
papers, ranging from national papers like the NYTimes, to small town papers. My 
database will have all of them, and possibly also the original from the wire 
service. Each paper will choose their own headline, and will edit the story 
differently for length to fill a hole on the printed page, so they cannot be 
trivially detected as duplicates, but to my users, they basically are.

I need to detect and group together these "duplicates" when displaying search 
results.

So let's say every story has had an integer hash value calculated of the first 
X words of the lead paragraph, and that value is indexed and stored (e.g. 
"similarity_hash"), as a way to detect duplicate stories.

I would want to Field Collapse my results on that hash value, so that all 
occurrences of the same story are lumped together.

Also, my users would much prefer the most "authoritative" version of the story 
to be displayed as the primary result, with a count and link to the collapsed 
results. Authoritativeness could be coded as simple as 1) Wire Service, 2) 
National Paper, 3) Regional Paper, 4) Small Town Paper, which could be index 
and stored as an integer "authority". (For finer-grained authority we could 
store the newspapers circulation numbers.)

Then I could display to users:
"Dog Bites Man" 
New York Times, _link to see 77 other duplicates_

So, finally getting to the point, would it be possible to make this feature 
work such that it field collapses results on one field ("similarity_hash"), 
selects the one to return based on another field ("authority" or 
"circulation')? (While allowing the results to be sorted by a third field, e.g. 
date or relevance.)

Perhaps by a new parameter?
 collapse.authority=[field] // indexed field used for selecting which result 
from collapsed group to return, default being... ?

If this sounds familiar, it is somewhat similar to what Google News is doing:
  http://www.pcworld.com/article/id,136680/article.html

Final question: Do you think Field Collapse could work nicely with SOLR-303 
Federated Search, or is that a bridge too far?

> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>         Attachments: field_collapsing_1.1.0.patch, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to