I want to bring this discussion to the dev list to gather broader feedback, as there have been some discussions that happened over multiple JIRA tickets (SPARK-16026 <https://issues.apache.org/jira/browse/SPARK-16026>, etc) and GitHub pull requests about what statistics to collect and how to use them.
There are some basic statistics on columns that are obvious to use and we don't need to debate these: estimated size (in bytes), row count, min, max, number of nulls, number of distinct values, average column length, max column length. In addition, we want to be able to estimate selectivity for equality and range predicates better, especially taking into account skewed values and outliers. Before I dive into the different options, let me first explain count-min sketch: Count-min sketch is a common sketch algorithm that tracks frequency counts. It has the following nice properties: - sublinear space - can be generated in one-pass in a streaming fashion - can be incrementally maintained (i.e. for appending new data) - it's already implemented in Spark - more accurate for frequent values, and less accurate for less-frequent values, i.e. it tracks skewed values well. - easy to compute inner product, i.e. trivial to compute the count-min sketch of a join given two count-min sketches of the join tables Proposal 1 is is to use a combination of count-min sketch and equi-height histograms. In this case, count-min sketch will be used for selectivity estimation on equality predicates, and histogram will be used on range predicates. Proposal 2 is to just use count-min sketch on equality predicates, and then simple selected_range / (max - min) will be used for range predicates. This will be less accurate than using histogram, but simpler because we don't need to collect histograms. Proposal 3 is a variant of proposal 2, and takes into account that skewed values can impact selectivity heavily. In 3, we track the list of heavy hitters (HH, most frequent items) along with count-min sketch on the column. Then: - use count-min sketch on equality predicates - for range predicates, estimatedFreq = sum(freq(HHInRange)) + range / (max - min) Proposal 4 is to not use any sketch, and use histogram for high cardinality columns, and exact (value, frequency) pairs for low cardinality columns (e.g. num distinct value <= 255). Proposal 5 is a variant of proposal 4, and adapts it to track exact (value, frequency) pairs for the most frequent values only, so we can still have that for high cardinality columns. This is actually very similar to count-min sketch, but might use less space, although requiring two passes to compute the initial value, and more difficult to compute the inner product for joins.