Re: statistics collection and propagation for cost-based optimizer

Reynold Xin Mon, 14 Nov 2016 09:56:22 -0800

Historically tpcds and tpch. There is certainly a chance of overfitting one
or two benchmarks. Note that those will probably be impacted more by the
way we set the parameters for CBO rather than using x or y for summary
statistics.


On Monday, November 14, 2016, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Do we have any query workloads for which we can benchmark these
> proposals in terms of performance ?
>
> Thanks
> Shivaram
>
> On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <r...@databricks.com
> <javascript:;>> wrote:
> > One additional note: in terms of size, the size of a count-min sketch
> with
> > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
> >
> > To look up what that means, see
> > http://spark.apache.org/docs/latest/api/java/org/apache/
> spark/util/sketch/CountMinSketch.html
> >
> >
> >
> >
> >
> > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <r...@databricks.com
> <javascript:;>> wrote:
> >>
> >> I want to bring this discussion to the dev list to gather broader
> >> feedback, as there have been some discussions that happened over
> multiple
> >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
> >> statistics to collect and how to use them.
> >>
> >> There are some basic statistics on columns that are obvious to use and
> we
> >> don't need to debate these: estimated size (in bytes), row count, min,
> max,
> >> number of nulls, number of distinct values, average column length, max
> >> column length.
> >>
> >> In addition, we want to be able to estimate selectivity for equality and
> >> range predicates better, especially taking into account skewed values
> and
> >> outliers.
> >>
> >> Before I dive into the different options, let me first explain count-min
> >> sketch: Count-min sketch is a common sketch algorithm that tracks
> frequency
> >> counts. It has the following nice properties:
> >> - sublinear space
> >> - can be generated in one-pass in a streaming fashion
> >> - can be incrementally maintained (i.e. for appending new data)
> >> - it's already implemented in Spark
> >> - more accurate for frequent values, and less accurate for less-frequent
> >> values, i.e. it tracks skewed values well.
> >> - easy to compute inner product, i.e. trivial to compute the count-min
> >> sketch of a join given two count-min sketches of the join tables
> >>
> >>
> >> Proposal 1 is is to use a combination of count-min sketch and
> equi-height
> >> histograms. In this case, count-min sketch will be used for selectivity
> >> estimation on equality predicates, and histogram will be used on range
> >> predicates.
> >>
> >> Proposal 2 is to just use count-min sketch on equality predicates, and
> >> then simple selected_range / (max - min) will be used for range
> predicates.
> >> This will be less accurate than using histogram, but simpler because we
> >> don't need to collect histograms.
> >>
> >> Proposal 3 is a variant of proposal 2, and takes into account that
> skewed
> >> values can impact selectivity heavily. In 3, we track the list of heavy
> >> hitters (HH, most frequent items) along with count-min sketch on the
> column.
> >> Then:
> >> - use count-min sketch on equality predicates
> >> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
> >> (max - min)
> >>
> >> Proposal 4 is to not use any sketch, and use histogram for high
> >> cardinality columns, and exact (value, frequency) pairs for low
> cardinality
> >> columns (e.g. num distinct value <= 255).
> >>
> >> Proposal 5 is a variant of proposal 4, and adapts it to track exact
> >> (value, frequency) pairs for the most frequent values only, so we can
> still
> >> have that for high cardinality columns. This is actually very similar to
> >> count-min sketch, but might use less space, although requiring two
> passes to
> >> compute the initial value, and more difficult to compute the inner
> product
> >> for joins.
> >>
> >>
> >>
> >
>

Re: statistics collection and propagation for cost-based optimizer

Reply via email to