[ 
https://issues.apache.org/jira/browse/SOLR-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518434#comment-17518434
 ] 

Michael Gibney commented on SOLR-16144:
---------------------------------------

Even if the current implementation is left as-is, we should at least throw an 
error if a client tries to explicitly set a {{min_popularity}} value less than 
{{0.00001}} (which would currently effectively exclude _all_ buckets).

However, I think it would be preferable to _not_ round these values internally. 
For a relatively high-cardinality field, perfect correlation for a 
{{background_popularity}} of 9/2,000,000 feels meaningful, and in any case well 
above any threshold that I might intuitively consider to indicate "noise". My 
sense is that different use cases would have different "noise" thresholds, and 
that the purpose of the {{min_popularity}} param is to allow clients to specify 
their own "noise" threshold. AFAICT it's cleaner and there's no real downside 
to deferring pop-value-rounding until the response is externalized to be sent 
back to the client.

[PR #790|https://github.com/apache/solr/pull/790] makes concrete the above 
"preferred" proposal.

> Don't internally round [foreground|background]_popularity values in 
> RelatednessAgg
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-16144
>                 URL: https://issues.apache.org/jira/browse/SOLR-16144
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Facet Module
>    Affects Versions: main (10.0)
>            Reporter: Michael Gibney
>            Priority: Trivial
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The "relatedness" facet function supports the concept of 
> {{foreground_popularity}} and {{background_popularity}} -- i.e., the 
> cardinality of the intersection of bucket domain with the foreground and 
> background sets (respectively), each normalized with respect to background 
> set cardinality.
> The logic appears to be:
> # To provide clients with context of computed relatedness values
> # To preemptively (optionally) screen out "noise" from low-frequency terms 
> via the {{min_popularity}} function parameter.
> For both purposes, popularity values are currently rounded to 5 digits.
> This issue proposes that although rounding to 5 digits makes sense for the 
> _first_ case (providing context to clients), this arbitrary truncation does 
> not make sense as currently implemented for internally evaluating threshold 
> pop values for bucket inclusion.
> Consider the case of a high-cardinality field with a relatively large 
> background set and a selective foreground set. For {{|background_set| = 
> 2,000,000}} and a foreground set of cardinality 9, even a bucket with a 
> domain that exactly matches the foreground set would be screened out, for 
> _any_ explicit setting of {{min_popularity}}.
> This behavior is due to where the rounding takes place (internally, upon 
> initial {{computeDerivedValues()}}). It is further problematic that 
> {{RelatednessAgg}} will currently accept {{min_popularity < 0.00001}}, which 
> would be guaranteed to exclude _all_ buckets.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to