[
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775192#action_12775192
]
Michael Gundlach commented on SOLR-236:
---------------------------------------
I've found an NPE that occurs when performing quasi-distributed field
collapsing.
My company only has one use case for field collapsing: collapsing on email
address. Our index is spread across multiple cores. We found that if we shard
by email address, so that a given all documents with a given email address are
guaranteed to appear on the same core, then we can do distributed field
collapsing.
We add &collapse.field=email and &shards=core1,core2,... to a regular query.
Each core collapses on email and sends the results back to the requestor.
Since no emails appear on more than one core, we've accomplished distributed
search. We do lose the <collapse_count> section, but that's not needed for our
purpose -- we just need an accurate total document count, and to have no more
than one document for a given email address in the results.
Unfortunately, this throws an NPE when searching on a tokenized field.
Searching string fields is fine. I don't understand exactly why the NPE
appears, but I did bandaid over it by checking explicitly for nulls at the
appropriate line in the code. No more NPE.
There's a downside, which is that if we attempt to collapse on a field other
than email -- one which has documents appearing in multiple cores -- the
results are buggy: the first search returns few documents, and the number of
documents actually displayed don't always match the "numFound" value. Then
upon refresh we get what we think is the correct numFound, and the correct list
of documents. This doesn't bother me too much, as you're guaranteed to get
incorrect answers from the collapse code anyway when collapsing on a field that
you didn't use as your key for sharding.
In the spirit of Yonik's law of patches, I have made two imperfect patches
attempting to contribute the fix, or at least point out the error:
1. I pulled trunk, applied the latest SOLR-236 patch, made my 2 line change,
and created a patch file. The resultant patch file looks very different from
the latest SOLR-236 patchfile, so I assume I did something wrong.
2. I pulled trunk, made my 2 line change, and created another patch file. This
file is tiny but of course is missing all of the field collapsing changes.
Would you like me to post either of these patchfiles to this issue? Or is it
sufficient to just tell you that the NPE occured in QueryComponent.java on line
556? ("rb._responseDocs.set(sdoc.positionInResponse, doc);" where sdoc was
null.) Perhaps my use case is extraordinary enough that you're happy leaving
the NPE in place and telling other users to not do what I'm doing?
Thanks!
Michael
> Field collapsing
> ----------------
>
> Key: SOLR-236
> URL: https://issues.apache.org/jira/browse/SOLR-236
> Project: Solr
> Issue Type: New Feature
> Components: search
> Affects Versions: 1.3
> Reporter: Emmanuel Keller
> Fix For: 1.5
>
> Attachments: collapsing-patch-to-1.3.0-dieter.patch,
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch,
> collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch,
> field-collapse-4-with-solrj.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-solr-236-2.patch, field-collapse-solr-236.patch,
> field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch,
> field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff,
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff,
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch,
> SOLR-236-FieldCollapsing.patch, solr-236.patch, SOLR-236_collapsing.patch,
> SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given
> field to a single entry in the result set. Site collapsing is a special case
> of this, where all results for a given web site is collapsed into one or two
> entries in the result set, typically with an associated "more documents from
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.