[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487839#comment-16487839
 ] 

Hoss Man commented on SOLR-12343:
---------------------------------

{quote}... I think it should just be considered a bug.
{quote}
That's pretty much my feeling, but I wasn't sure.
{quote}Truncating the list of buckets to N before the refinement phase would 
fix the bug, but it would also throw away complete buckets that could make it 
into the top N after refinement.
{quote}
oh right ... yeah, i was forgetting about buckets that got data from all shards 
in phase #1.
{quote}Exactly which buckets we chose to refine (and exactly how many) can 
remain an implementation detail. ...
{quote}
right ... it can be heuristically determined, and very conservative in cases 
where we know it doesn't matter – but i still think there should be an explicit 
option...
----
I worked up a patch similar to the straw man i outlined above – except that i 
didn't add the {{refine:required}} variant since we're in agreement that this 
is a bug.

In the new patch:
 * buckets now keep track of how many shards contributed to them
 ** I did this with a quick and dirty BitSet instead of an {{int 
numShardsContributing}} counter since we have to handle the possibility that 
{{mergeBuckets()}} will get called more then once for a single shard when we 
have partial refinement of sub-facets
 ** there's a nocommit in here about the possibility of re-using the 
{{Context.sawShard}} BitSet instead – but i couldn't wrap my head around an 
efficient way to do it so i punted
 * during the final "pruning" in {{FacetFieldMerger.getMergedResult()}} buckets 
are excluded if a bucket doesn't have contributions from as many shards as the 
FacetField
 ** again, i needed a new BitSet in at the FacetField level to count the shards 
– because {{Context.numShards}} may include shards that never return *any* 
results for the facet (ie: empty shard) so they never merge any data at all)
 * there is a new {{overrefine:N}} option which works similar to overrequest – 
but instead of determining how many "extra" terms to request in phase#1, it 
determines how many "extra" buckets should be in {{numBucketsToCheck}} for 
refinement in phase #2 (but if some buckets are already fully populated in 
phase #2, then the actual number "refined" in phase#2 can be lower then 
limit+overrefine)
 ** the default hueristic currently pays attention to the sort – since (IIUC) 
{{count desc}} and {{index asc|desc}} should never need any "over refinement" 
unless {{mincount > 1}}
 ** if we have a non-trivial sort, and the user specified an explicit 
{{overrequest:N}} then the default hueristic for {{overrefine}} uses the same 
value {{N}}
 *** because i'm assuming if people have explicitly requested {{sort:SPECIAL, 
refine:true, overrequest:N}} then they care about the accuracy of the the terms 
to some degree N, and the bigger N is the more we should care about 
over-refinement as well.
 ** if neither {{overrequest}} or {{overrefine}} are explicitly set, then we 
use the same {{limit * 1.1 + 4}} type hueristic as {{overrequest}}
 ** there's another nocommit here though: if we're using a hueritic, should we 
be scaling the derived {{numBucketsToCheck}} based on {{mincount}} ? ... if 
{{mincount=M > 1}} should we be doing something like {{numBucketsToCheck *= M}} 
??
 *** although, thinking about it now – this kind of mincount based factor would 
probably make more sense in the {{overrequest}} hueristic? maybe for 
{{overrefine}} we should look at how many buckets were already fully populated 
in phase#1 _AND_ meet the mincount, and use the the difference between that 
number and the limit to decide a scaling factor?
 *** either way: can probably TODO this for a future enhancement.
 * Testing wise...
 ** These changes fix the problems in previous test patch
 ** I've also added some more tests, but there's nocommit's to add a lot more 
including verification of nested facets
 ** I didn't want to go too deep down the testing rabbit hole until i was sure 
we wanted to go this route.

what do you think?

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-12343
>                 URL: https://issues.apache.org/jira/browse/SOLR-12343
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to