[jira] Updated: (SOLR-236) Field collapsing

Martijn van Groningen (JIRA) Thu, 10 Sep 2009 16:26:32 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Martijn van Groningen updated SOLR-236:
---------------------------------------

    Attachment: field-collapse-5.patch

I have updated the field collapse patch with the following:
1. Added the return collapse documents feature. When the parameter 
_collapse.includeCollapsedDocs_ with value true is specified then the collapsed 
documents will returned per distinct field value. When this feature is enabled 
a collapsedDocs element is added to the field collapse response part. It looks 
like this:
{code:xml}
<lst name="collapsedDocs">
  <result name="Amsterdam" numFound="2" start="0">
        <doc>
         <str name="id">262701</str>
         <str name="title">Bitterzoet, 100% Halal, Appletree Records &amp; Deux 
d'Amsterdam presents</str>
        </doc>
        <doc>
         <str name="id">327511</str>
         <str name="title">Salsa Danscafé</str>
        </doc>
  </result>
 </lst>
{code}
It is also possible to return only specific fields with the 
_collapse.includeCollapsedDocs.fl_ parameter. It expects fieldnames delimited 
by comma, just like the normal fl parameter. 

These feature can dramatically impact the performance, because a group can 
potently contain many documents which all have to retrieved from the index and 
transported over the wire. So it is certainly wise to use it in combination 
with the fl parameter. 
2. Added Solrj support for collapsed documents feature. 
3. Added the performance improvements that Abdul suggested.
4. The debug information is now *not* returned by default. When the parameter 
_collapse.debug_ with value true is specified, then the debug information is 
returned.
5. When field collapsing is done on a field that is multivalued or tokenized 
then an exception is thrown. I have chosen to do this because collapsing on 
such fields lead to unexpected results. For example when a field is tokenized 
only the last token of the field can be retrieved from the fieldcache (the 
fieldcache is used for retrieving the fields from the index in a cached manner 
for grouping documents into groups of distinct field values). This results in 
collapsing only on the last token of a field value instead of the complete 
field value. Multivalued fields have similar behaviour, plus for multivalued 
fields the Lucene FieldCache throws an exception when there are more tokens for 
a field than documents. Personally I think that throwing an exception is better 
then have unexpected results, at least it is clear that something field 
collapse related is wrong.
6. When doing a normal field collapse and not sorting on score the Solr caching 
mechanism is used. Unfortunately this was previously not the case.

@Paul
When doing non adjacent collapsing (aka normal collapsing) the Solr caches are 
not being used. The current patch uses the Solr caches when doing a search 
without scoring, but still the most common case is of course field collapsing 
and sorting on score. This is because the non adjacent field collapse algorithm 
requires the score of all results, which is collected with a Lucene collector. 
The search method on the SolrIndexSearcher that specifies a collector, does not 
have caching capabilities. In the next patch I will fix this problem, so that 
normal field collapse search uses the Solr caches as they should. The adjacent 
collapsing algorithm *does* use the solr caches, but the algorithm is much 
slower than non adjacent collapsing.

> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
> collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, 
> field-collapse-4-with-solrj.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-solr-236-2.patch, 
> field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, 
> field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
> solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-236) Field collapsing

Reply via email to