[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated SOLR-236:
---------------------------------------

    Attachment: field-collapse-5.patch

I have attached a new patch which includes a major refactoring which makes the 
code more flexible and cleaner. The patch also includes a new aggregate 
functionality and a bug fix.

h3. Aggregate function and bug fix
The new patch allows you to execute aggregate functions on the collapsed 
documents (for example sum the stock amount or calculating the minimum price of 
a collapsed group). Currently there are four aggregate functions available: 
sum(), min(), max() and avg(). To execute one or more functions the 
_collapse.aggregate_ parameter has to be added to the request url. The 
parameter expects the following syntax: _function_name(field_name)[, 
function_name(field_name)]_. For example: collapse.aggregate=sum(stock), 
min(price) and might have a result like this:
{code:xml}
<lst name="aggregatedResults">
   <lst name="sum(stock)">
      <str name="Amsterdam">10</str>
      ...
   </lst>
   <lst name="min(price)">
      <str name="Amsterdam">5.99</str>
      ...
   </lst>
</lst>
{code}

The patch also fixes a bug inside the {{NonAdjacentDocumentCollapser}} that was 
reported on the solr-user mailing list a few days ago. An index out of bounds 
exception was thrown when documents were removed from an index and a field 
collapse search was done afterwards.  

h3. Code refactoring
The code refactoring includes the following things:
* The notion of a {{CollapseGroup}}. A collapse group defines what an unique 
group is in the search result. For the adjacent and non adjacent document 
collapser this is different. For adjacent field collapsing a group is defined 
by its field value and the document id of the most relevant document in that 
group. More then one collapse group may have the same fieldvalue. For normal 
field collapsing (non adjacent) the group is defined just by the field value. 
* The notion of a {{CollapseCollector}} that receives the collapsed documents 
from a {{DocumentCollector}} and does something with it. For example keeps a 
count of how many documents were collapsed per collapse group or computes an 
average of a certain field like price. As you can see in the code instead of 
using field values or document ids a collapse group is used for identifying a 
collapse group.
{code}
/**
 * A <code>CollapseCollector</code> is responsible for receiving collapse 
callbacks from the <code>DocumentCollapser</code>.
 * An implementation can choose what to do with the received callbacks and 
data. Whatever an implementation collects it
 * is responsible for adding its results to the response.
 *
 * Implementation of this interface don't need to be thread safe!
 */
public interface CollapseCollector {

  /**
   * Informs the <code>CollapseCollector</code> that a document has been 
collapsed under the specified collapseGroup.
   *
   * @param docId The id of the document that has been collasped
   * @param collapseGroup The collapse group the docId has been collapsed under
   * @param collapseContext The collapse context
   */
  void documentCollapsed(int docId, CollapseGroup collapseGroup, 
CollapseContext collapseContext);

  /**
   * Informs the <code>CollapseCollector</code> about the document head.
   * The document head is the most relevant id for the specified collapseGroup.
   *
   * @param docHeadId The identifier of the document head
   * @param collapseGroup The collapse group of the document head
   * @param collapseContext The collapse context
   */
  void documentHead(int docHeadId, CollapseGroup collapseGroup, CollapseContext 
collapseContext);

  /**
   * Adds the <code>CollapseCollector</code> implementation specific result 
data to the result.
   *
   * @param result The response result 
   * @param docs The documents to be added to the response
   * @param collapseContext The collapse context
   */
  void getResult(NamedList result, DocList docs, CollapseContext 
collapseContext);

}
{code}
There is also a {{CollapseContext}} that allows you store data that can be 
shared between {{CollapseCollectors}}. 
* A {{CollapseCollectorFactory}} is responsible for creating a 
{{CollepseCollector}}. It does this based on the {{SolrQueryRequest}}. All the 
logic for when to enable a certain {{CollapseCollector}} must be placed in the 
factory. 
{code}
/**
 * A concrete <code>CollapseCollectorFactory</code> implementation is 
responsible for creating {...@link CollapseCollector}
 * instances based on the {...@link SolrQueryRequest}.
 */
public interface CollapseCollectorFactory {

  /**
   * Creates an instance of a CollapseCollector specified by the concrete 
subclass.
   * The concrete subclass decides based on the specified request if an new 
instance has to be created and
   * can return <code>null</code> for that matter.
   * 
   * @param request The specified request
   * @return an instance of a CollapseCollector or <code>null</code>
   */
  CollapseCollector createCollapseCollector(SolrQueryRequest request);

}
{code}
Currently there are four {{CollapseCollectorFactories}} implementations:
# {{DocumentGroupCountCollapseCollectorFactory}} creates {{CollapseCollectors}} 
that collect the collapse counts per document group and return the counts in 
the response per collapsed group most relevant document id.
# {{FieldValueCountCollapseCollectorFactory}} creates {{CollapseCollectors}} 
that collect the collapse count per collapsed group and return the counts in 
the response per collepsed group field value.
# {{DocumentFieldsCollapseCollectorFactory}} creates {{CollapseCollectors}} 
that collect predefined fieldvalues from collapsed documents.
# {{AggregateCollapseCollectorFactory}} creates {{CollapseCollectors}} that 
create aggregate statistics based on the collapsed documents.
{{CollapseCollectorFactories}} are configured in the solrconfig.xml and by 
default all implementations in the patch are configured. The following 
configuration is sufficient 
{code:xml}
<searchComponent name="collapse" 
class="org.apache.solr.handler.component.CollapseComponent" />
{code}
The following configurations configures the same {{CollapseCollectorFactories}} 
as the previous configuration:
{code:xml}
<searchComponent name="collapse" 
class="org.apache.solr.handler.component.CollapseComponent">
    <arr name="collapseCollectorFactories">
        <str>groupDocumentsCounts</str>
        <str>groupFieldValue</str>
        <str>groupDocumentsFields</str>
        <str>groupAggregatedData</str>
    </arr>
  </searchComponent>

  <fieldCollapsing>
    <collapseCollectorFactory name="groupDocumentsCounts" 
class="solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" 
/>

    <collapseCollectorFactory name="groupFieldValue" 
class="solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" />

    <collapseCollectorFactory name="groupDocumentsFields" 
class="solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" />

    <collapseCollectorFactory name="groupAggregatedData" 
class="org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory">
        <lst name="aggregateFunctions">
            <str 
name="sum">org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction</str>
            <str 
name="avg">org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction</str>
            <str 
name="min">org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction</str>
            <str 
name="max">org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction</str>
        </lst>
    </collapseCollectorFactory>
  </fieldCollapsing>
{code}
The {{CollapseCollectorFactories}} configured can be shared among different 
{{CollapseComponents}}. Most users do not have to do this, but when you 
creating your own implementations or someone else's then you have to do this in 
order to configure the {{CollapseCollectorFactory}} implementation. The order 
in collapseCollectorFactories does matter. {{CollapseCollectors}} may share 
data via the {{CollapseContext}} for that reason the order is depend. The 
{{CollapseCollectorFactories}} in the patch do not share data, but other 
implementations may.

The new patch contains a lot of changes, but I personally think that the patch 
is really an improvement especially the introduction of the 
{{CollapseCollectors}} that allows a lot of flexibility. Btw any feedback or 
questions are welcome.

> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
> collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, 
> field-collapse-4-with-solrj.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-solr-236-2.patch, 
> field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, 
> field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
> solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to