Historically tpcds and tpch. There is certainly a chance of overfitting one or two benchmarks. Note that those will probably be impacted more by the way we set the parameters for CBO rather than using x or y for summary statistics.
On Monday, November 14, 2016, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > Do we have any query workloads for which we can benchmark these > proposals in terms of performance ? > > Thanks > Shivaram > > On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <r...@databricks.com > <javascript:;>> wrote: > > One additional note: in terms of size, the size of a count-min sketch > with > > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes. > > > > To look up what that means, see > > http://spark.apache.org/docs/latest/api/java/org/apache/ > spark/util/sketch/CountMinSketch.html > > > > > > > > > > > > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <r...@databricks.com > <javascript:;>> wrote: > >> > >> I want to bring this discussion to the dev list to gather broader > >> feedback, as there have been some discussions that happened over > multiple > >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what > >> statistics to collect and how to use them. > >> > >> There are some basic statistics on columns that are obvious to use and > we > >> don't need to debate these: estimated size (in bytes), row count, min, > max, > >> number of nulls, number of distinct values, average column length, max > >> column length. > >> > >> In addition, we want to be able to estimate selectivity for equality and > >> range predicates better, especially taking into account skewed values > and > >> outliers. > >> > >> Before I dive into the different options, let me first explain count-min > >> sketch: Count-min sketch is a common sketch algorithm that tracks > frequency > >> counts. It has the following nice properties: > >> - sublinear space > >> - can be generated in one-pass in a streaming fashion > >> - can be incrementally maintained (i.e. for appending new data) > >> - it's already implemented in Spark > >> - more accurate for frequent values, and less accurate for less-frequent > >> values, i.e. it tracks skewed values well. > >> - easy to compute inner product, i.e. trivial to compute the count-min > >> sketch of a join given two count-min sketches of the join tables > >> > >> > >> Proposal 1 is is to use a combination of count-min sketch and > equi-height > >> histograms. In this case, count-min sketch will be used for selectivity > >> estimation on equality predicates, and histogram will be used on range > >> predicates. > >> > >> Proposal 2 is to just use count-min sketch on equality predicates, and > >> then simple selected_range / (max - min) will be used for range > predicates. > >> This will be less accurate than using histogram, but simpler because we > >> don't need to collect histograms. > >> > >> Proposal 3 is a variant of proposal 2, and takes into account that > skewed > >> values can impact selectivity heavily. In 3, we track the list of heavy > >> hitters (HH, most frequent items) along with count-min sketch on the > column. > >> Then: > >> - use count-min sketch on equality predicates > >> - for range predicates, estimatedFreq = sum(freq(HHInRange)) + range / > >> (max - min) > >> > >> Proposal 4 is to not use any sketch, and use histogram for high > >> cardinality columns, and exact (value, frequency) pairs for low > cardinality > >> columns (e.g. num distinct value <= 255). > >> > >> Proposal 5 is a variant of proposal 4, and adapts it to track exact > >> (value, frequency) pairs for the most frequent values only, so we can > still > >> have that for high cardinality columns. This is actually very similar to > >> count-min sketch, but might use less space, although requiring two > passes to > >> compute the initial value, and more difficult to compute the inner > product > >> for joins. > >> > >> > >> > > >