> Performance and resource is still affected by 30M unique values of T
right?
Yes. The main performance issue would be the per-request allocation of a
30M-element `long[]` for "dv" or "uif" methods (which are by far the most
common methods in practice). With low enough request volume and large
enough heap you might not actually perceive a difference in performance;
but if you encounter problems for the use case you describe, this array
allocation would likely be the cause. (also note that the relevant field
cardinality is the _per-shard_ cardinality, so in a multi-shard collection
the size of the allocated arrays might be somewhat less than the overall
field cardinality)

I'm reasonably sure that "dvhash" is _not_ auto-picked by "smart" at the
moment, but rather must be specified explicitly:
https://github.com/apache/lucene-solr/blob/6ff4a9b395a68d9b0d9e259537e3f5daf0278d51/solr/core/src/java/org/apache/solr/search/facet/FacetField.java#L124-L128

The code snippet above indicates some other restrictions that you're
probably already aware of (doesn't work with prefixes or mincount==0, or
for multi-valued or numeric types); otherwise though (for non-numeric
single-valued field) I think the situation you describe (high-cardinality
field, known low-cardinality for the particular domain) sounds like a
perfect use-case for dvhash.

Michael

On Fri, Feb 5, 2021 at 11:56 AM ufuk yılmaz <uyil...@vivaldi.net.invalid>
wrote:

> Hello,
>
> I’m using Solr 8.4. Very excited about performance improvements in 8.8:
> http://joelsolr.blogspot.com/2021/01/optimizations-coming-to-solr.html
>
> As I understand the main determinator of performance and RAM usage of a
> terms facet is cardinality of the field in whole collection, but not the
> cardinality of field in query result.
>
> I have a collection with 100M docs, T field has 30M unique values in
> entire collection. But my query result returns only docs with 2 different T
> values,
>
> {
>         “q”: “some query”, //whose result has only 2 different T values
>         “facet”: {
>                 “type”: “terms”,
>                 “field”: “T”,
>                 “limit”: 15
> }
>
> Performance and resource is still affected by 30M unique values of T right?
>
> If this is correct, can/how “method”: “dvhash” help in this case?
> If yes, does the default method “smart” take this into account and use the
> dvhash, so I shouldn’t to set it explicitly?
>
> Nice weekends
> ~ufuk
>

Reply via email to