[jira] [Commented] (SOLR-15008) Avoid building OrdinalMap for each facet

Michael Gibney (Jira) Thu, 19 Nov 2020 07:25:09 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235549#comment-17235549
 ]


Michael Gibney commented on SOLR-15008:
---------------------------------------

Interesting; I'm surprised that profiling indicated {{OrdinalMap}} building, 
since I'm pretty sure the {{OrdinalMap}} instances (as accessed via 
{{FacetFieldProcessorByArrayDV}}  are already cached in the way you're 
suggesting:
# in 
[FacetFieldProcessorByArrayDV.findStartAndEndOrds(...)|https://github.com/apache/lucene-solr/blob/40e2122b5a5b89f446e51692ef0d72e48c7b71e5/solr/core/src/java/org/apache/solr/search/facet/FacetFieldProcessorByArrayDV.java#L60]
# in 
[FieldUtil.getSortedSetDocValues(...)|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/java/org/apache/solr/search/facet/FieldUtil.java#L55]
# in 
[SlowCompositeReaderWrapper.getSortedSetDocValues(...)|https://github.com/apache/lucene-solr/blob/c02f07f2d5db5c983c2eedf71febf9516189595d/solr/core/src/java/org/apache/solr/index/SlowCompositeReaderWrapper.java#L197-L211]

Do you have more information about the total numbers involved (high-cardinality 
field -- specifically how high per core? how many documents overall per core? 
how many cores? does the latency manifest even across a single indexSearcher -- 
i.e., no intervening updates?). A couple of things that might be worth doing in 
the meantime, just as a sanity check:
# disable refinement for the facet field ({{"refinement":"none"}}) -- among 
other things, this would take the {{filterCache}} out of the equation
# if possible, try optimizing each replica to a single segment, which should 
take {{OrdinalMap}} out of the equation (this of course strictly diagnostic, 
not a "workaround" suggestion).

{quote}Allow faceting on actual values (a Map) rather than ordinals
{quote}
Interesting -- even if {{OrdinalMap}} is already getting cached (as I think it 
is?), this would be one way to avoid the overhead of allocating a 
{{CountSlotArrAcc}} backed by an int array of a size matching the field 
cardinality (this is why I asked more specifically about the cardinality of the 
field involved). I'm not sure how big a problem this is in practice, but I 
imagine a value-Map-based faceting implementation would probably perform better 
for this type of use case ... not 100% sure though, and not sure how _much_ 
better ... (I think {{FacetFieldProcessorByHashDV}} was designed to meet this a 
similar use case, but it only works for single-valued fields).

> Avoid building OrdinalMap for each facet
> ----------------------------------------
>
>                 Key: SOLR-15008
>                 URL: https://issues.apache.org/jira/browse/SOLR-15008
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Facet Module
>    Affects Versions: 8.7
>            Reporter: Radu Gheorghe
>            Priority: Major
>              Labels: performance
>         Attachments: Screenshot 2020-11-19 at 12.01.55.png
>
>
> I'm running against the following scenario:
>  * [JSON] faceting on a high cardinality field
>  * few matching documents => few unique values
> Yet the query almost always takes a long time. Here's an example taking 
> almost 4s for ~300 documents and unique values (edited a bit):
>  
> {code:java}
>     "QTime":3869,
>     "params":{
>       "json":"{\"query\": \"*:*\",
>       \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
> \"unique_id:49866\"]
>       \"facet\": 
> {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
>       "rows":"0"}},
>   
> "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
>   },
>   "facets":{
>     "count":333,
>     "keywords":{
>       "buckets":[{
>           "val":"value1",
>           "count":124},
>   ...
> {code}
> I did some [profiling with our Sematext 
> Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
> points me to OrdinalMap building (see attached screenshot). If I read the 
> code right, an OrdinalMap is built with every facet. And it's expensive since 
> there are many unique values in the shard (previously, there we more smaller 
> shards, making latency better, but this approach doesn't scale for this 
> particular use-case).
> If I'm right up to this point, I see a couple of potential improvements, 
> [inspired from 
> Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:
>  # *Keep the OrdinalMap cached until the next softCommit*, so that only the 
> first query takes the penalty
>  # *Allow faceting on actual values (a Map) rather than ordinals*, for 
> situations like the one above where we have few matching documents. We could 
> potentially auto-detect this scenario (e.g. by configuring a threshold) and 
> use a Map when there are few documents
> I'm curious about what you're thinking:
>  * would a PR/patch be welcome for any of the two ideas above?
>  * do you see better options? am I missing something?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-15008) Avoid building OrdinalMap for each facet

Reply via email to