[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martijn van Groningen updated SOLR-236: --------------------------------------- Attachment: field-collapse-5.patch I have updated the field collapse patch with the following: 1. Added the return collapse documents feature. When the parameter _collapse.includeCollapsedDocs_ with value true is specified then the collapsed documents will returned per distinct field value. When this feature is enabled a collapsedDocs element is added to the field collapse response part. It looks like this: {code:xml} <lst name="collapsedDocs"> <result name="Amsterdam" numFound="2" start="0"> <doc> <str name="id">262701</str> <str name="title">Bitterzoet, 100% Halal, Appletree Records & Deux d'Amsterdam presents</str> </doc> <doc> <str name="id">327511</str> <str name="title">Salsa Danscafé</str> </doc> </result> </lst> {code} It is also possible to return only specific fields with the _collapse.includeCollapsedDocs.fl_ parameter. It expects fieldnames delimited by comma, just like the normal fl parameter. These feature can dramatically impact the performance, because a group can potently contain many documents which all have to retrieved from the index and transported over the wire. So it is certainly wise to use it in combination with the fl parameter. 2. Added Solrj support for collapsed documents feature. 3. Added the performance improvements that Abdul suggested. 4. The debug information is now *not* returned by default. When the parameter _collapse.debug_ with value true is specified, then the debug information is returned. 5. When field collapsing is done on a field that is multivalued or tokenized then an exception is thrown. I have chosen to do this because collapsing on such fields lead to unexpected results. For example when a field is tokenized only the last token of the field can be retrieved from the fieldcache (the fieldcache is used for retrieving the fields from the index in a cached manner for grouping documents into groups of distinct field values). This results in collapsing only on the last token of a field value instead of the complete field value. Multivalued fields have similar behaviour, plus for multivalued fields the Lucene FieldCache throws an exception when there are more tokens for a field than documents. Personally I think that throwing an exception is better then have unexpected results, at least it is clear that something field collapse related is wrong. 6. When doing a normal field collapse and not sorting on score the Solr caching mechanism is used. Unfortunately this was previously not the case. @Paul When doing non adjacent collapsing (aka normal collapsing) the Solr caches are not being used. The current patch uses the Solr caches when doing a search without scoring, but still the most common case is of course field collapsing and sorting on score. This is because the non adjacent field collapse algorithm requires the score of all results, which is collected with a Lucene collector. The search method on the SolrIndexSearcher that specifies a collector, does not have caching capabilities. In the next patch I will fix this problem, so that normal field collapse search uses the Solr caches as they should. The adjacent collapsing algorithm *does* use the solr caches, but the algorithm is much slower than non adjacent collapsing. > Field collapsing > ---------------- > > Key: SOLR-236 > URL: https://issues.apache.org/jira/browse/SOLR-236 > Project: Solr > Issue Type: New Feature > Components: search > Affects Versions: 1.3 > Reporter: Emmanuel Keller > Fix For: 1.5 > > Attachments: collapsing-patch-to-1.3.0-dieter.patch, > collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, > collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, > field-collapse-4-with-solrj.patch, field-collapse-5.patch, > field-collapse-5.patch, field-collapse-solr-236-2.patch, > field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, > field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, > field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, > field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, > SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, > solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch > > > This patch include a new feature called "Field collapsing". > "Used in order to collapse a group of results with similar value for a given > field to a single entry in the result set. Site collapsing is a special case > of this, where all results for a given web site is collapsed into one or two > entries in the result set, typically with an associated "more documents from > this site" link. See also Duplicate detection." > http://www.fastsearch.com/glossary.aspx?m=48&amid=299 > The implementation add 3 new query parameters (SolrParams): > "collapse.field" to choose the field used to group results > "collapse.type" normal (default value) or adjacent > "collapse.max" to select how many continuous results are allowed before > collapsing > TODO (in progress): > - More documentation (on source code) > - Test cases > Two patches: > - "field_collapsing.patch" for current development version > - "field_collapsing_1.1.0.patch" for Solr-1.1.0 > P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.