[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martijn van Groningen updated SOLR-236: --------------------------------------- Attachment: field-collapse-5.patch I have attached a new patch which includes a major refactoring which makes the code more flexible and cleaner. The patch also includes a new aggregate functionality and a bug fix. h3. Aggregate function and bug fix The new patch allows you to execute aggregate functions on the collapsed documents (for example sum the stock amount or calculating the minimum price of a collapsed group). Currently there are four aggregate functions available: sum(), min(), max() and avg(). To execute one or more functions the _collapse.aggregate_ parameter has to be added to the request url. The parameter expects the following syntax: _function_name(field_name)[, function_name(field_name)]_. For example: collapse.aggregate=sum(stock), min(price) and might have a result like this: {code:xml} <lst name="aggregatedResults"> <lst name="sum(stock)"> <str name="Amsterdam">10</str> ... </lst> <lst name="min(price)"> <str name="Amsterdam">5.99</str> ... </lst> </lst> {code} The patch also fixes a bug inside the {{NonAdjacentDocumentCollapser}} that was reported on the solr-user mailing list a few days ago. An index out of bounds exception was thrown when documents were removed from an index and a field collapse search was done afterwards. h3. Code refactoring The code refactoring includes the following things: * The notion of a {{CollapseGroup}}. A collapse group defines what an unique group is in the search result. For the adjacent and non adjacent document collapser this is different. For adjacent field collapsing a group is defined by its field value and the document id of the most relevant document in that group. More then one collapse group may have the same fieldvalue. For normal field collapsing (non adjacent) the group is defined just by the field value. * The notion of a {{CollapseCollector}} that receives the collapsed documents from a {{DocumentCollector}} and does something with it. For example keeps a count of how many documents were collapsed per collapse group or computes an average of a certain field like price. As you can see in the code instead of using field values or document ids a collapse group is used for identifying a collapse group. {code} /** * A <code>CollapseCollector</code> is responsible for receiving collapse callbacks from the <code>DocumentCollapser</code>. * An implementation can choose what to do with the received callbacks and data. Whatever an implementation collects it * is responsible for adding its results to the response. * * Implementation of this interface don't need to be thread safe! */ public interface CollapseCollector { /** * Informs the <code>CollapseCollector</code> that a document has been collapsed under the specified collapseGroup. * * @param docId The id of the document that has been collasped * @param collapseGroup The collapse group the docId has been collapsed under * @param collapseContext The collapse context */ void documentCollapsed(int docId, CollapseGroup collapseGroup, CollapseContext collapseContext); /** * Informs the <code>CollapseCollector</code> about the document head. * The document head is the most relevant id for the specified collapseGroup. * * @param docHeadId The identifier of the document head * @param collapseGroup The collapse group of the document head * @param collapseContext The collapse context */ void documentHead(int docHeadId, CollapseGroup collapseGroup, CollapseContext collapseContext); /** * Adds the <code>CollapseCollector</code> implementation specific result data to the result. * * @param result The response result * @param docs The documents to be added to the response * @param collapseContext The collapse context */ void getResult(NamedList result, DocList docs, CollapseContext collapseContext); } {code} There is also a {{CollapseContext}} that allows you store data that can be shared between {{CollapseCollectors}}. * A {{CollapseCollectorFactory}} is responsible for creating a {{CollepseCollector}}. It does this based on the {{SolrQueryRequest}}. All the logic for when to enable a certain {{CollapseCollector}} must be placed in the factory. {code} /** * A concrete <code>CollapseCollectorFactory</code> implementation is responsible for creating {...@link CollapseCollector} * instances based on the {...@link SolrQueryRequest}. */ public interface CollapseCollectorFactory { /** * Creates an instance of a CollapseCollector specified by the concrete subclass. * The concrete subclass decides based on the specified request if an new instance has to be created and * can return <code>null</code> for that matter. * * @param request The specified request * @return an instance of a CollapseCollector or <code>null</code> */ CollapseCollector createCollapseCollector(SolrQueryRequest request); } {code} Currently there are four {{CollapseCollectorFactories}} implementations: # {{DocumentGroupCountCollapseCollectorFactory}} creates {{CollapseCollectors}} that collect the collapse counts per document group and return the counts in the response per collapsed group most relevant document id. # {{FieldValueCountCollapseCollectorFactory}} creates {{CollapseCollectors}} that collect the collapse count per collapsed group and return the counts in the response per collepsed group field value. # {{DocumentFieldsCollapseCollectorFactory}} creates {{CollapseCollectors}} that collect predefined fieldvalues from collapsed documents. # {{AggregateCollapseCollectorFactory}} creates {{CollapseCollectors}} that create aggregate statistics based on the collapsed documents. {{CollapseCollectorFactories}} are configured in the solrconfig.xml and by default all implementations in the patch are configured. The following configuration is sufficient {code:xml} <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" /> {code} The following configurations configures the same {{CollapseCollectorFactories}} as the previous configuration: {code:xml} <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent"> <arr name="collapseCollectorFactories"> <str>groupDocumentsCounts</str> <str>groupFieldValue</str> <str>groupDocumentsFields</str> <str>groupAggregatedData</str> </arr> </searchComponent> <fieldCollapsing> <collapseCollectorFactory name="groupDocumentsCounts" class="solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" /> <collapseCollectorFactory name="groupFieldValue" class="solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" /> <collapseCollectorFactory name="groupDocumentsFields" class="solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" /> <collapseCollectorFactory name="groupAggregatedData" class="org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory"> <lst name="aggregateFunctions"> <str name="sum">org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction</str> <str name="avg">org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction</str> <str name="min">org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction</str> <str name="max">org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction</str> </lst> </collapseCollectorFactory> </fieldCollapsing> {code} The {{CollapseCollectorFactories}} configured can be shared among different {{CollapseComponents}}. Most users do not have to do this, but when you creating your own implementations or someone else's then you have to do this in order to configure the {{CollapseCollectorFactory}} implementation. The order in collapseCollectorFactories does matter. {{CollapseCollectors}} may share data via the {{CollapseContext}} for that reason the order is depend. The {{CollapseCollectorFactories}} in the patch do not share data, but other implementations may. The new patch contains a lot of changes, but I personally think that the patch is really an improvement especially the introduction of the {{CollapseCollectors}} that allows a lot of flexibility. Btw any feedback or questions are welcome. > Field collapsing > ---------------- > > Key: SOLR-236 > URL: https://issues.apache.org/jira/browse/SOLR-236 > Project: Solr > Issue Type: New Feature > Components: search > Affects Versions: 1.3 > Reporter: Emmanuel Keller > Fix For: 1.5 > > Attachments: collapsing-patch-to-1.3.0-dieter.patch, > collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, > collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, > field-collapse-4-with-solrj.patch, field-collapse-5.patch, > field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, > field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, > field-collapse-5.patch, field-collapse-solr-236-2.patch, > field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, > field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, > field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, > field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, > SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, > solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch > > > This patch include a new feature called "Field collapsing". > "Used in order to collapse a group of results with similar value for a given > field to a single entry in the result set. Site collapsing is a special case > of this, where all results for a given web site is collapsed into one or two > entries in the result set, typically with an associated "more documents from > this site" link. See also Duplicate detection." > http://www.fastsearch.com/glossary.aspx?m=48&amid=299 > The implementation add 3 new query parameters (SolrParams): > "collapse.field" to choose the field used to group results > "collapse.type" normal (default value) or adjacent > "collapse.max" to select how many continuous results are allowed before > collapsing > TODO (in progress): > - More documentation (on source code) > - Test cases > Two patches: > - "field_collapsing.patch" for current development version > - "field_collapsing_1.1.0.patch" for Solr-1.1.0 > P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.