[jira] [Commented] (SOLR-11733) json.facet refinement fails to bubble up some long tail (overrequested) terms?

Hoss Man (JIRA) Thu, 07 Dec 2017 10:27:45 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282264#comment-16282264
 ]


Hoss Man commented on SOLR-11733:
---------------------------------



bq. I mentioned in SOLR-11729 the refinement algorithm being different (and for 
a single-level facet field, simpler).

FWIW, here's yonik's comment from SOLR-11729 which seems to specifically be on 
point for this issue (emphaiss mine)...

bq. It seems like there are many logical ways to refine results - I originally 
thought about using refine:simple because I imagined we would have other 
implementations in the future.  Anyway, this one is the simplest one to think 
about and implement: *the top buckets to return for all facets are determined 
in the first phase.* The second phase only gets contributions from other shards 
for those buckets.

bq. i.e. simple refinement doesn't change the buckets you get back.

Ah ... ok.  I didn't realize the refinement approach in {{json.facet}} wasn't 
as sophisticated as {{facet.field}}

To summarize again (in my own words to ensure I'm understanding you correctly):

# do a first pass, requesting "#limit + #overrequest" buckets from each shard
#* use the accumulated results of the first pass to determine the "top #limit 
buckets"
# do a second passs, in which we back-fill the "top #limit buckets" with data 
from any shards that have no yet contributed.

In which case, in my example above, the reason {{yyy}} isn't refined, even 
though it has the same "first pass" total as {{x1}}, is because during the 
first pass {{x1}} sorts higher (due to a secondary tie breaker sort on the 
terms) pushing {{yyy}} out of the "top 6".  (likewise {{x2}} and {{tail}} are 
never considered because they were never part of the "top 6" even w/o a tie 
breaker sort)

Do I have that correct?

----

The Bottom line: even if i don't fully grasp the current refinement mechanism 
you've described, is that you're saying the behavior i described with the above 
sample documents is *not* a bug: it's the intended/expected behavior of 
{{refine:true}} (aka {{refine:simple}} )

If so i'll edit this jira into an "Improvement" & update the 
summary/description to clarify how {{facet.pivot}} refinement differs from 
{{json.facet}} + {{refine:simple}} & leave open for future improvement

----
----

As far as discussion on potential improvements....


bq. From a correctness POV, smarter faceting is equivalent to increasing the 
overrequest amount... we still can't make guarantees.

Hmmm... I'm not sure that i agree with that assessment.  I guess 
"mathematically" speaking it's true that compared to a "smarter" refinement 
method, this "simple" refine method can product equally "correct" top terms 
solely by increasing the overrequest amount -- but that's like saying we don't 
even need any refinement method at all as long as we specify an infinite amount 
of overrequest.

With the refinement approach used by {{facet.field}} (and {{facet.pivot}}) we 
*can* make garuntees about the correctness of the top terms -- regardless of 
if/how-much overrequesting is used -- _for any term that is in the "top 
buckets" of at least one shard_.

IIUC the current {{json.facet}} refinement method can't make _any_ similar 
garuntees at all, regardless of what (finite) overrequest value is specified 
... but {{facet.field}} certainly can:

In {{facet.field}} today, If:
* A term is in the "top buckets" (limit + overrequest) returned by at least one 
shard
* And the sort value (ie: count) returned by that shard (along with the lowest 
sort-value/count returned by all other shards) indicates that the term _might_ 
be competitive realtive to the other terms returned by other shards
...then that term is refined. That's a garuntee we can make.

Meaning that even if you have shards with widely diff term stats (ie: time 
partioned shards, or docs co-located due to multi-level compositeId, or block 
join, etc..) we can/will refine the top terms from each shard.

In {{facet.field}} the overrequest helps to:
* increase the scope of how deep we look to find the "top (candidate) terms" 
from each shard
* decreases the amount of data we have to request when refineing

...but the *distribution* of terms across shards has very little (none? ... not 
certain) impact on the "correctness" of the "top N" in the aggregate.  Even if 
the first pass "top terms" from each shard is 100% unique, the *realtive* 
"bottom" counts from each shard is considered before assuming that the "higher" 
counts should win -- meaning that if the shards have very different sizes, "top 
terms" from the smaller shards still have a chance of being considered as an 
"aggregated top term" as long as the "bottom count" from the (larger) shards is 
high enough to indicate that those (missing) terms might still be competitive.

But in the {{json.facet}} approach to refinement, IIUC: A term returned by only 
one shard won't be considered unless the count from _just that one shard_ is 
high enough to help it dominate over the *cumulative* counts from each of the 
top terms of the other shards.

Which seems to not only make the amount of overrequesting _much_ more important 
to consider when requesting refinement, but also requires you to consider the 
comparative *sizes* of the shards, and the potential term distribution 
variances between them.  


Or to put it another way...

*TL,DR: IIUC, the amount of overrequest is _much_ more important to consider 
when requesting refinement on {{json.facet}} then it has ever been with 
{{facet.field}}, but when picking an overrequest amount for {{json.facet}}, 
people also need to consider the relative differences in _sizes_ of their 
shards, and the potential term distribution variances that may exist between 
them.*


(correct?)

----

bq. We could easily implement a mode for some field facets that does the "could 
this possibly be in the top N" logic to consider more buckets in the first 
phase... but only if it's not a sub-facet of another partial facet (a facet 
with something like a limit). If we're sorting by something other than count 
(like stddev for instance) then I guess we'd have to discard smart pruning and 
just try to get all buckets we saw in the first phase.

You lost me there.... If the sort is on some criteria other then count (ex: 
stddev), why can't we compute a hypothetical "best case" sort value for the 
candidates based on the pre-aggregation values returned by the "bottom" of the 
other shards (ex: the sum, sumsq, and num_values already needed from each shard 
for the aggregated stddev) in combination with the values from the one shard 
that *does* have that term?

bq. If a partial facet is a sub-facet of another partial-facet, the logic of 
what one can exclude seems to get harder, ...

You _completely_ lost me there ... I *think* maybe you're alluding to the need 
for multi-stage refinement depending on how deep the nested facets go?  which 
FWIW is exactly what {{facet.pivot}} does today.




> json.facet refinement fails to bubble up some long tail (overrequested) terms?
> ------------------------------------------------------------------------------
>
>                 Key: SOLR-11733
>                 URL: https://issues.apache.org/jira/browse/SOLR-11733
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Facet Module
>            Reporter: Hoss Man
>
> Something wonky is happening with {{json.facet}} refinement.
> "Long Tail" terms that may not be in the "top n" on every shard, but are in 
> the "top n + overrequest" for at least 1 shard aren't getting refined and 
> included in the aggragated response in some cases.
> I don't understand the code enough to explain this, but I have some steps to 
> reproduce that i'll post in a comment shortly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11733) json.facet refinement fails to bubble up some long tail (overrequested) terms?

Reply via email to