[ https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235549#comment-17235549 ]
Michael Gibney commented on SOLR-15008: --------------------------------------- Interesting; I'm surprised that profiling indicated {{OrdinalMap}} building, since I'm pretty sure the {{OrdinalMap}} instances (as accessed via {{FacetFieldProcessorByArrayDV}} are already cached in the way you're suggesting: # in [FacetFieldProcessorByArrayDV.findStartAndEndOrds(...)|https://github.com/apache/lucene-solr/blob/40e2122b5a5b89f446e51692ef0d72e48c7b71e5/solr/core/src/java/org/apache/solr/search/facet/FacetFieldProcessorByArrayDV.java#L60] # in [FieldUtil.getSortedSetDocValues(...)|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/java/org/apache/solr/search/facet/FieldUtil.java#L55] # in [SlowCompositeReaderWrapper.getSortedSetDocValues(...)|https://github.com/apache/lucene-solr/blob/c02f07f2d5db5c983c2eedf71febf9516189595d/solr/core/src/java/org/apache/solr/index/SlowCompositeReaderWrapper.java#L197-L211] Do you have more information about the total numbers involved (high-cardinality field -- specifically how high per core? how many documents overall per core? how many cores? does the latency manifest even across a single indexSearcher -- i.e., no intervening updates?). A couple of things that might be worth doing in the meantime, just as a sanity check: # disable refinement for the facet field ({{"refinement":"none"}}) -- among other things, this would take the {{filterCache}} out of the equation # if possible, try optimizing each replica to a single segment, which should take {{OrdinalMap}} out of the equation (this of course strictly diagnostic, not a "workaround" suggestion). {quote}Allow faceting on actual values (a Map) rather than ordinals {quote} Interesting -- even if {{OrdinalMap}} is already getting cached (as I think it is?), this would be one way to avoid the overhead of allocating a {{CountSlotArrAcc}} backed by an int array of a size matching the field cardinality (this is why I asked more specifically about the cardinality of the field involved). I'm not sure how big a problem this is in practice, but I imagine a value-Map-based faceting implementation would probably perform better for this type of use case ... not 100% sure though, and not sure how _much_ better ... (I think {{FacetFieldProcessorByHashDV}} was designed to meet this a similar use case, but it only works for single-valued fields). > Avoid building OrdinalMap for each facet > ---------------------------------------- > > Key: SOLR-15008 > URL: https://issues.apache.org/jira/browse/SOLR-15008 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Facet Module > Affects Versions: 8.7 > Reporter: Radu Gheorghe > Priority: Major > Labels: performance > Attachments: Screenshot 2020-11-19 at 12.01.55.png > > > I'm running against the following scenario: > * [JSON] faceting on a high cardinality field > * few matching documents => few unique values > Yet the query almost always takes a long time. Here's an example taking > almost 4s for ~300 documents and unique values (edited a bit): > > {code:java} > "QTime":3869, > "params":{ > "json":"{\"query\": \"*:*\", > \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", > \"unique_id:49866\"] > \"facet\": > {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}", > "rows":"0"}}, > > "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[] > }, > "facets":{ > "count":333, > "keywords":{ > "buckets":[{ > "val":"value1", > "count":124}, > ... > {code} > I did some [profiling with our Sematext > Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it > points me to OrdinalMap building (see attached screenshot). If I read the > code right, an OrdinalMap is built with every facet. And it's expensive since > there are many unique values in the shard (previously, there we more smaller > shards, making latency better, but this approach doesn't scale for this > particular use-case). > If I'm right up to this point, I see a couple of potential improvements, > [inspired from > Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]: > # *Keep the OrdinalMap cached until the next softCommit*, so that only the > first query takes the penalty > # *Allow faceting on actual values (a Map) rather than ordinals*, for > situations like the one above where we have few matching documents. We could > potentially auto-detect this scenario (e.g. by configuring a threshold) and > use a Map when there are few documents > I'm curious about what you're thinking: > * would a PR/patch be welcome for any of the two ideas above? > * do you see better options? am I missing something? > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org