[
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Martijn van Groningen updated SOLR-236:
---------------------------------------
Attachment: field-collapse-5.patch
I have attached a new patch which includes a major refactoring which makes the
code more flexible and cleaner. The patch also includes a new aggregate
functionality and a bug fix.
h3. Aggregate function and bug fix
The new patch allows you to execute aggregate functions on the collapsed
documents (for example sum the stock amount or calculating the minimum price of
a collapsed group). Currently there are four aggregate functions available:
sum(), min(), max() and avg(). To execute one or more functions the
_collapse.aggregate_ parameter has to be added to the request url. The
parameter expects the following syntax: _function_name(field_name)[,
function_name(field_name)]_. For example: collapse.aggregate=sum(stock),
min(price) and might have a result like this:
{code:xml}
<lst name="aggregatedResults">
<lst name="sum(stock)">
<str name="Amsterdam">10</str>
...
</lst>
<lst name="min(price)">
<str name="Amsterdam">5.99</str>
...
</lst>
</lst>
{code}
The patch also fixes a bug inside the {{NonAdjacentDocumentCollapser}} that was
reported on the solr-user mailing list a few days ago. An index out of bounds
exception was thrown when documents were removed from an index and a field
collapse search was done afterwards.
h3. Code refactoring
The code refactoring includes the following things:
* The notion of a {{CollapseGroup}}. A collapse group defines what an unique
group is in the search result. For the adjacent and non adjacent document
collapser this is different. For adjacent field collapsing a group is defined
by its field value and the document id of the most relevant document in that
group. More then one collapse group may have the same fieldvalue. For normal
field collapsing (non adjacent) the group is defined just by the field value.
* The notion of a {{CollapseCollector}} that receives the collapsed documents
from a {{DocumentCollector}} and does something with it. For example keeps a
count of how many documents were collapsed per collapse group or computes an
average of a certain field like price. As you can see in the code instead of
using field values or document ids a collapse group is used for identifying a
collapse group.
{code}
/**
* A <code>CollapseCollector</code> is responsible for receiving collapse
callbacks from the <code>DocumentCollapser</code>.
* An implementation can choose what to do with the received callbacks and
data. Whatever an implementation collects it
* is responsible for adding its results to the response.
*
* Implementation of this interface don't need to be thread safe!
*/
public interface CollapseCollector {
/**
* Informs the <code>CollapseCollector</code> that a document has been
collapsed under the specified collapseGroup.
*
* @param docId The id of the document that has been collasped
* @param collapseGroup The collapse group the docId has been collapsed under
* @param collapseContext The collapse context
*/
void documentCollapsed(int docId, CollapseGroup collapseGroup,
CollapseContext collapseContext);
/**
* Informs the <code>CollapseCollector</code> about the document head.
* The document head is the most relevant id for the specified collapseGroup.
*
* @param docHeadId The identifier of the document head
* @param collapseGroup The collapse group of the document head
* @param collapseContext The collapse context
*/
void documentHead(int docHeadId, CollapseGroup collapseGroup, CollapseContext
collapseContext);
/**
* Adds the <code>CollapseCollector</code> implementation specific result
data to the result.
*
* @param result The response result
* @param docs The documents to be added to the response
* @param collapseContext The collapse context
*/
void getResult(NamedList result, DocList docs, CollapseContext
collapseContext);
}
{code}
There is also a {{CollapseContext}} that allows you store data that can be
shared between {{CollapseCollectors}}.
* A {{CollapseCollectorFactory}} is responsible for creating a
{{CollepseCollector}}. It does this based on the {{SolrQueryRequest}}. All the
logic for when to enable a certain {{CollapseCollector}} must be placed in the
factory.
{code}
/**
* A concrete <code>CollapseCollectorFactory</code> implementation is
responsible for creating {...@link CollapseCollector}
* instances based on the {...@link SolrQueryRequest}.
*/
public interface CollapseCollectorFactory {
/**
* Creates an instance of a CollapseCollector specified by the concrete
subclass.
* The concrete subclass decides based on the specified request if an new
instance has to be created and
* can return <code>null</code> for that matter.
*
* @param request The specified request
* @return an instance of a CollapseCollector or <code>null</code>
*/
CollapseCollector createCollapseCollector(SolrQueryRequest request);
}
{code}
Currently there are four {{CollapseCollectorFactories}} implementations:
# {{DocumentGroupCountCollapseCollectorFactory}} creates {{CollapseCollectors}}
that collect the collapse counts per document group and return the counts in
the response per collapsed group most relevant document id.
# {{FieldValueCountCollapseCollectorFactory}} creates {{CollapseCollectors}}
that collect the collapse count per collapsed group and return the counts in
the response per collepsed group field value.
# {{DocumentFieldsCollapseCollectorFactory}} creates {{CollapseCollectors}}
that collect predefined fieldvalues from collapsed documents.
# {{AggregateCollapseCollectorFactory}} creates {{CollapseCollectors}} that
create aggregate statistics based on the collapsed documents.
{{CollapseCollectorFactories}} are configured in the solrconfig.xml and by
default all implementations in the patch are configured. The following
configuration is sufficient
{code:xml}
<searchComponent name="collapse"
class="org.apache.solr.handler.component.CollapseComponent" />
{code}
The following configurations configures the same {{CollapseCollectorFactories}}
as the previous configuration:
{code:xml}
<searchComponent name="collapse"
class="org.apache.solr.handler.component.CollapseComponent">
<arr name="collapseCollectorFactories">
<str>groupDocumentsCounts</str>
<str>groupFieldValue</str>
<str>groupDocumentsFields</str>
<str>groupAggregatedData</str>
</arr>
</searchComponent>
<fieldCollapsing>
<collapseCollectorFactory name="groupDocumentsCounts"
class="solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory"
/>
<collapseCollectorFactory name="groupFieldValue"
class="solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" />
<collapseCollectorFactory name="groupDocumentsFields"
class="solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" />
<collapseCollectorFactory name="groupAggregatedData"
class="org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory">
<lst name="aggregateFunctions">
<str
name="sum">org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction</str>
<str
name="avg">org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction</str>
<str
name="min">org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction</str>
<str
name="max">org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction</str>
</lst>
</collapseCollectorFactory>
</fieldCollapsing>
{code}
The {{CollapseCollectorFactories}} configured can be shared among different
{{CollapseComponents}}. Most users do not have to do this, but when you
creating your own implementations or someone else's then you have to do this in
order to configure the {{CollapseCollectorFactory}} implementation. The order
in collapseCollectorFactories does matter. {{CollapseCollectors}} may share
data via the {{CollapseContext}} for that reason the order is depend. The
{{CollapseCollectorFactories}} in the patch do not share data, but other
implementations may.
The new patch contains a lot of changes, but I personally think that the patch
is really an improvement especially the introduction of the
{{CollapseCollectors}} that allows a lot of flexibility. Btw any feedback or
questions are welcome.
> Field collapsing
> ----------------
>
> Key: SOLR-236
> URL: https://issues.apache.org/jira/browse/SOLR-236
> Project: Solr
> Issue Type: New Feature
> Components: search
> Affects Versions: 1.3
> Reporter: Emmanuel Keller
> Fix For: 1.5
>
> Attachments: collapsing-patch-to-1.3.0-dieter.patch,
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch,
> collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch,
> field-collapse-4-with-solrj.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-solr-236-2.patch,
> field-collapse-solr-236.patch, field-collapsing-extended-592129.patch,
> field_collapsing_1.1.0.patch, field_collapsing_1.3.patch,
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff,
> field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch,
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch,
> solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given
> field to a single entry in the result set. Site collapsing is a special case
> of this, where all results for a given web site is collapsed into one or two
> entries in the result set, typically with an associated "more documents from
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.