Michael Gibney created SOLR-15836:
-------------------------------------

             Summary: Address counterintuitive behavior of JSON "terms" 
subfacet refinement
                 Key: SOLR-15836
                 URL: https://issues.apache.org/jira/browse/SOLR-15836
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
          Components: Facet Module
    Affects Versions: 8.11, main (9.0)
            Reporter: Michael Gibney


In distributed faceting, uneven distribution of terms across different shards 
can artificially include or exclude terms (this discussion will focus on JSON 
Facet "terms" faceting).

This is inevitable, and can be mitigated via {{overrequest}} and {{overrefine}} 
parameters -- respectively casting a "wider net" for "phase#1" (determining the 
set of "terms of interest") and "phase#2" (cross-checking "terms of interest" 
against terms that did not initially report them).

It is possible to devise artificial situations that push the limit of what 
{{overrefine}} is capable of mitigating, resulting in counterintuitive 
behavior. But despite such edge cases, in general it is relatively 
straightforward to reason about how the {{simple}} JSON Facet refinement method 
works for "flat" (i.e., non-hierarchical) terms facets.

This issue discusses some ways in which subfacets (hierarchical or nested 
facets) can more readily behave counterintuitively in practical usage, and 
possible ways to address/mitigate such behavior.

---------------------

AFAICT, the {{simple}} (default, currently the only) refinement method has two 
defining requirements:
# there is at most _one_ refinement request issued to each shard, and
# any buckets returned are guaranteed to have accurate counts (or perhaps more 
generally, stats?) reflecting contributions from all shards. (this makes [no 
guarantees|https://issues.apache.org/jira/browse/SOLR-11159?focusedCommentId=16103386#comment-16103386]
 about buckets _not_ returned that would in principle be eligible to be 
returned).
 
The simplest counterintuitive case is when refinement of higher-level facets 
uncovers more subfacets on shards that have no opportunity to influence 
results/refinement of the child facet. I'm pretty sure it's this situation 
that's described in [this 
comment|https://github.com/apache/solr/blob/0287458f836e3b7ea4b2401538b29f3d2e9b6cf4/solr/core/src/test/org/apache/solr/search/facet/TestJsonFacetRefinement.java#L992-L994]
 (by [~hossman]?):

{code:java}
    //   - or at the very least, if the purpose of "_l" is to give other 
buckets a chance to "bubble up"
    //     in phase#2, then shouldn't a "_l" refinement requests still include 
the buckets choosen in
    //     phase#1, and request that the shard fill them in in addition to 
returning its own top buckets?
{code}

The proposal in the above linked comment would work iff the "own top buckets" 
returned in phase#2 did not introduce any new/unseen values (and note, the only 
case in which returning "own top buckets" would be significant _would_ be the 
case in which it would introduce new/unseen values). If new values _were_ 
returned in phase#2, the only way to ensure that requirement2 is respected 
would be to violate requirement1 (i.e., by issuing _another_ refinement request 
to determine whether any other shards have anything to contribute to the 
previously unseen value).

This counterintuitive behavior can't exactly be called a "bug", because IIUC 
the intuitive behavior is fundamentally incompatible with the current 
default/only {{simple}} refinement method.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to