Re: What is “high cardinality” in facet streams?

Shawn Heisey Tue, 20 Feb 2018 08:11:39 -0800

On 2/20/2018 4:44 AM, Alfonso Muñoz-Pomer Fuentes wrote:

We have a query that we can resolve using either facet or search with rollup. 
In the Stream Source Reference section of Solr’s Reference Guide 
(https://lucene.apache.org/solr/guide/7_1/stream-source-reference.html#facet) 
it says “To support high cardinality aggregations see the rollup function”. I 
was wondering what it’s considered “high cardinality”. If it serves, our query 
returns up to 60k results. I haven’t got to do any benchmarking to see if 
there’s any difference, though, because facet so far performs very well, but I 
don’t know if I’m near the “tipping point”. Any feedback would be appreciated.

There's no hard and fast rule for this. The tipping point is going tobe different for every use case. With a little bit of information aboutyour setup, experienced users can make an educated guess about whetheror not performance will be good, but cannot say with absolute certaintywhat you're going to run into.


Let's start with some definitions, which you may or may not already know:

https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
https://en.wikipedia.org/wiki/Cardinality

You haven't said how many unique values are in your field. The onlyinformation I have from you is 60K results from your queries, which mayor may not have any bearing on the total number of documents in yourindex, or the total number of unique values in the field you're usingfor faceting. So the next paragraph may or may not apply to your index.

In general, 60,000 unique values in a field would be considered very lowcardinality, because computers can typically operate on 60,000 values*very* quickly, unless the size of each value is enormous. But if theindex has 60,000 total documents, then *in relation to other data*, thecardinality is very high, even though most people would say theopposite. Sixty thousand documents or unique values is almost always avery small index, not prone to performance issues.

The warnings about cardinality in the Solr documentation mostly refer to*absolute* cardinality -- how many unique values there are in a field,regardless of the actual number of documents. If there are millions orbillions of unique values, then operations like facets, grouping,sorting, etc are probably going to be slow. If there are a lot less,such as thousands or only a handful, then those operations are likely tobe very fast, because the computer will have less information it mustprocess.


Thanks,
Shawn

Re: What is “high cardinality” in facet streams?

Reply via email to