[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771155#action_12771155
 ] 

Martijn van Groningen commented on SOLR-236:
--------------------------------------------

It certainly has be going on for a long time :-)
Talking about the last miles there are a few things in my mind about field 
collapsing:
* Change the response format. Currently if I look at the response even I get 
confused sometimes about the information returned. The response should more 
structured. Something like this:
{code:xml}
<lst name="collapse_counts">
    <str name="field">venue</str>
    <lst name="results">
        <lst name="233238"> <!-- id of most relevant document of the group -->
            <str name="fieldValue">melkweg</str>
            <int name="collapseCount">2</int>
            <!-- and other CollapseCollector specific collapse information -->
        </lst>
        ...
    </lst>
</lst>
{code}
Currently when doing adjacent field collapsing the _collapse_counts_ gives 
results that are unusable to use. The _collapse_counts_ use the field value as 
key which is not unique for adjacent collapsing as shown in the example: 
{code:xml}
<lst name="collapse_counts">
 <int name="hard">1</int>
 <int name="hard">1</int>
 <int name="electronics">1</int>
 <int name="memory">2</int>
 <int name="monitor">1</int>
</lst>
{code}
* Add the notion of a CollapseMatcher, that decides whether document field 
values are equal or not and thus whether they are allowed to be collapsed. This 
opens the road for more exotic features like fuzzy field collapsing and 
collapsing on more than one field. Also this allows users of the patch to 
easily implement their own matching rules.
* Distributed field collapsing. Although I have some ideas on how to get 
started, from my perspective it not going to be performed. Because somehow the 
field collapse state has to be shared between shards in order to do proper 
field collapsing. This state can potentially be a lot of data depending on the 
specific search and corpus.
* And maybe add a collapse collector that collects statistics about most common 
field value per collapsed group. 

I think that this is somewhat the roadmap from my side for field collapsing at 
moment, but feel free to elaborate on this.
Btw I have recently written a 
[blog|http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/]
 about field collapsing in general, that might be handy for someone who is 
implementing field collapsing. 

> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
> collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, 
> field-collapse-4-with-solrj.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
> field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
> field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, solr-236.patch, SOLR-236_collapsing.patch, 
> SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to