I want to bring this discussion to the dev list to gather broader feedback,
as there have been some discussions that happened over multiple JIRA
tickets (SPARK-16026 <https://issues.apache.org/jira/browse/SPARK-16026>,
etc) and GitHub pull requests about what statistics to collect and how to
use them.

There are some basic statistics on columns that are obvious to use and we
don't need to debate these: estimated size (in bytes), row count, min, max,
number of nulls, number of distinct values, average column length, max
column length.

In addition, we want to be able to estimate selectivity for equality and
range predicates better, especially taking into account skewed values and
outliers.

Before I dive into the different options, let me first explain count-min
sketch: Count-min sketch is a common sketch algorithm that tracks frequency
counts. It has the following nice properties:
- sublinear space
- can be generated in one-pass in a streaming fashion
- can be incrementally maintained (i.e. for appending new data)
- it's already implemented in Spark
- more accurate for frequent values, and less accurate for less-frequent
values, i.e. it tracks skewed values well.
- easy to compute inner product, i.e. trivial to compute the count-min
sketch of a join given two count-min sketches of the join tables


Proposal 1 is is to use a combination of count-min sketch and equi-height
histograms. In this case, count-min sketch will be used for selectivity
estimation on equality predicates, and histogram will be used on range
predicates.

Proposal 2 is to just use count-min sketch on equality predicates, and then
simple selected_range / (max - min) will be used for range predicates. This
will be less accurate than using histogram, but simpler because we don't
need to collect histograms.

Proposal 3 is a variant of proposal 2, and takes into account that skewed
values can impact selectivity heavily. In 3, we track the list of heavy
hitters (HH, most frequent items) along with count-min sketch on the
column. Then:
- use count-min sketch on equality predicates
- for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
(max - min)

Proposal 4 is to not use any sketch, and use histogram for high cardinality
columns, and exact (value, frequency) pairs for low cardinality columns
(e.g. num distinct value <= 255).

Proposal 5 is a variant of proposal 4, and adapts it to track exact (value,
frequency) pairs for the most frequent values only, so we can still have
that for high cardinality columns. This is actually very similar to
count-min sketch, but might use less space, although requiring two passes
to compute the initial value, and more difficult to compute the inner
product for joins.

Reply via email to