subject:"statistics collection and propagation for cost\-based optimizer"

RE: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread assaf.mendelson

I am not sure I understand when the statistics would be calculated. Would they 
always be calculated or just when analyze is called?
Would it be possible to save analysis results as part of dataframe saving (e.g. 
when writing it to parquet) or do we have to have a consistent hive 
installation?
Would it be possible to provide the hints manually? For example for streaming 
if I know the data in the beginning is not a representative of the entire 
stream?

From: rxin [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19873...@n3.nabble.com]
Sent: Tuesday, November 15, 2016 8:48 AM
To: Mendelson, Assaf
Subject: Re: statistics collection and propagation for cost-based optimizer

They are not yet complete. The benchmark was done with an implementation of 
cost-based optimizer Huawei had internally for Spark 1.5 (or some even older 
version).

On Mon, Nov 14, 2016 at 10:46 PM, Yogesh Mahajan <[hidden 
email]> wrote:
It looks like Huawei team have run TPC-H benchmark and some real-world test 
cases and their results show good performance gain in 2X-5X speedup depending 
on data volume.
Can we share the numbers and query wise rational behind the gain? Are there 
anything done on spark master yet? Or the implementation is not yet completed?

Thanks,
Yogesh Mahajan
http://www.snappydata.io/blog<http://snappydata.io>

On Tue, Nov 15, 2016 at 12:03 PM, Yogesh Mahajan <[hidden 
email]> wrote:

Thanks Reynold for the detailed proposals. A few questions/clarifications -

1) How the existing rule based operator co-exist with CBO? The existing rules 
are heuristics/empirical based, i am assuming rules like predicate pushdown or 
project pruning will co-exist with CBO and we just want to accurately estimate 
the filter factor and cardinality to make it more accurate? With predicate 
pushdown, a filter is mostly executed at an early stage of a query plan and the 
cardinality estimate of a predicate can improve the precision of cardinality 
estimates.

2. Will the query transformations be now based on the cost calculation? If yes, 
then what happens when the cost of execution of the transformed statement is 
higher than the cost of untransformed query?

3. Is there any upper limit on space used for storing the frequency histogram? 
255? And in case of more distinct values, we can even consider height balanced 
histogram in Oracle.

4. The first three proposals are new and not mentioned in the CBO design spec. 
CMS is good but it's less accurate compared the traditional histograms. This is 
a  major trade-off  we need to consider.

5. Are we going to consider system statistics- such as speed of CPU or disk 
access as a cost function? How about considering shuffle cost, output 
partitioning etc?

6. Like the current rule based optimizer, will this CBO also be an 'extensible 
optimizer'? If yes, what functionality users can extend?

7. Why this CBO will be disabled by default? “spark.sql.cbo" is false by 
default as it's just experimental ?

8. ANALYZE TABLE, analyzeColumns etc ... all look good.

9. From the release point of view, how this is planned ? Will all this be 
implemented in one go or in phases?

Thanks,
Yogesh Mahajan
http://www.snappydata.io/blog<http://snappydata.io>

On Mon, Nov 14, 2016 at 11:25 PM, Reynold Xin <[hidden 
email]> wrote:
Historically tpcds and tpch. There is certainly a chance of overfitting one or 
two benchmarks. Note that those will probably be impacted more by the way we 
set the parameters for CBO rather than using x or y for summary statistics.

On Monday, November 14, 2016, Shivaram Venkataraman <[hidden 
email]> wrote:
Do we have any query workloads for which we can benchmark these
proposals in terms of performance ?

Thanks
Shivaram

On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <[hidden 
email]> wrote:
> One additional note: in terms of size, the size of a count-min sketch with
> eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
>
> To look up what that means, see
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html
>
>
>
>
>
> On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <[hidden 
> email]> wrote:
>>
>> I want to bring this discussion to the dev list to gather broader
>> feedback, as there have been some discussions that happened over multiple
>> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
>> statistics to collect and how to use them.
>>
>> There are some basic statistics on columns that are obvious to use and we
>> don't need to debate these: estimated size (in bytes), row count, min, max,
>> number of nulls, number of distinct values, average column length, max
>> column length.
>>
>> In addition, we want to be able to estimate selectivity for equality and
>> range predicates better, especially taking into account skewed valu

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Reynold Xin

They are not yet complete. The benchmark was done with an implementation of
cost-based optimizer Huawei had internally for Spark 1.5 (or some even
older version).

On Mon, Nov 14, 2016 at 10:46 PM, Yogesh Mahajan 
wrote:

> It looks like Huawei team have run TPC-H benchmark and some real-world
> test cases and their results show good performance gain in 2X-5X speedup
> depending on data volume.
> Can we share the numbers and query wise rational behind the gain? Are
> there anything done on spark master yet? Or the implementation is not yet
> completed?
>
> Thanks,
> Yogesh Mahajan
> http://www.snappydata.io/blog 
>
> On Tue, Nov 15, 2016 at 12:03 PM, Yogesh Mahajan 
> wrote:
>
>>
>> Thanks Reynold for the detailed proposals. A few questions/clarifications
>> -
>>
>> 1) How the existing rule based operator co-exist with CBO? The existing
>> rules are heuristics/empirical based, i am assuming rules like predicate
>> pushdown or project pruning will co-exist with CBO and we just want to
>> accurately estimate the filter factor and cardinality to make it more
>> accurate? With predicate pushdown, a filter is mostly executed at an early
>> stage of a query plan and the cardinality estimate of a predicate can
>> improve the precision of cardinality estimates.
>>
>> 2. Will the query transformations be now based on the cost calculation?
>> If yes, then what happens when the cost of execution of the transformed
>> statement is higher than the cost of untransformed query?
>>
>> 3. Is there any upper limit on space used for storing the frequency
>> histogram? 255? And in case of more distinct values, we can even consider
>> height balanced histogram in Oracle.
>>
>> 4. The first three proposals are new and not mentioned in the CBO design
>> spec. CMS is good but it's less accurate compared the traditional
>> histograms. This is a  major trade-off  we need to consider.
>>
>> 5. Are we going to consider system statistics- such as speed of CPU or
>> disk access as a cost function? How about considering shuffle cost, output
>> partitioning etc?
>>
>> 6. Like the current rule based optimizer, will this CBO also be an
>> 'extensible optimizer'? If yes, what functionality users can extend?
>>
>> 7. Why this CBO will be disabled by default? “spark.sql.cbo" is false by
>> default as it's just experimental ?
>>
>> 8. ANALYZE TABLE, analyzeColumns etc ... all look good.
>>
>> 9. From the release point of view, how this is planned ? Will all this be
>> implemented in one go or in phases?
>>
>> Thanks,
>> Yogesh Mahajan
>> http://www.snappydata.io/blog 
>>
>> On Mon, Nov 14, 2016 at 11:25 PM, Reynold Xin 
>> wrote:
>>
>>> Historically tpcds and tpch. There is certainly a chance of overfitting
>>> one or two benchmarks. Note that those will probably be impacted more by
>>> the way we set the parameters for CBO rather than using x or y for summary
>>> statistics.
>>>
>>>
>>> On Monday, November 14, 2016, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 Do we have any query workloads for which we can benchmark these
 proposals in terms of performance ?

 Thanks
 Shivaram

 On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin 
 wrote:
 > One additional note: in terms of size, the size of a count-min sketch
 with
 > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
 >
 > To look up what that means, see
 > http://spark.apache.org/docs/latest/api/java/org/apache/spar
 k/util/sketch/CountMinSketch.html
 >
 >
 >
 >
 >
 > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin 
 wrote:
 >>
 >> I want to bring this discussion to the dev list to gather broader
 >> feedback, as there have been some discussions that happened over
 multiple
 >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
 >> statistics to collect and how to use them.
 >>
 >> There are some basic statistics on columns that are obvious to use
 and we
 >> don't need to debate these: estimated size (in bytes), row count,
 min, max,
 >> number of nulls, number of distinct values, average column length,
 max
 >> column length.
 >>
 >> In addition, we want to be able to estimate selectivity for equality
 and
 >> range predicates better, especially taking into account skewed
 values and
 >> outliers.
 >>
 >> Before I dive into the different options, let me first explain
 count-min
 >> sketch: Count-min sketch is a common sketch algorithm that tracks
 frequency
 >> counts. It has the following nice properties:
 >> - sublinear space
 >> - can be generated in one-pass in a streaming fashion
 >> - can be incrementally maintained (i.e. for appending new data)
 >> - it's already implemented in Spark

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Yogesh Mahajan

Thanks Reynold for the detailed proposals. A few questions/clarifications -

1) How the existing rule based operator co-exist with CBO? The existing
rules are heuristics/empirical based, i am assuming rules like predicate
pushdown or project pruning will co-exist with CBO and we just want to
accurately estimate the filter factor and cardinality to make it more
accurate? With predicate pushdown, a filter is mostly executed at an early
stage of a query plan and the cardinality estimate of a predicate can
improve the precision of cardinality estimates.

2. Will the query transformations be now based on the cost calculation? If
yes, then what happens when the cost of execution of the transformed
statement is higher than the cost of untransformed query?

3. Is there any upper limit on space used for storing the frequency
histogram? 255? And in case of more distinct values, we can even consider
height balanced histogram in Oracle.

4. The first three proposals are new and not mentioned in the CBO design
spec. CMS is good but it's less accurate compared the traditional
histograms. This is a  major trade-off  we need to consider.

5. Are we going to consider system statistics- such as speed of CPU or disk
access as a cost function? How about considering shuffle cost, output
partitioning etc?

6. Like the current rule based optimizer, will this CBO also be an
'extensible optimizer'? If yes, what functionality users can extend?

7. Why this CBO will be disabled by default? “spark.sql.cbo" is false by
default as it's just experimental ?

8. ANALYZE TABLE, analyzeColumns etc ... all look good.

9. From the release point of view, how this is planned ? Will all this be
implemented in one go or in phases?

Thanks,
Yogesh Mahajan
http://www.snappydata.io/blog 

On Mon, Nov 14, 2016 at 11:25 PM, Reynold Xin  wrote:

> Historically tpcds and tpch. There is certainly a chance of overfitting
> one or two benchmarks. Note that those will probably be impacted more by
> the way we set the parameters for CBO rather than using x or y for summary
> statistics.
>
>
> On Monday, November 14, 2016, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Do we have any query workloads for which we can benchmark these
>> proposals in terms of performance ?
>>
>> Thanks
>> Shivaram
>>
>> On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin  wrote:
>> > One additional note: in terms of size, the size of a count-min sketch
>> with
>> > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
>> >
>> > To look up what that means, see
>> > http://spark.apache.org/docs/latest/api/java/org/apache/spar
>> k/util/sketch/CountMinSketch.html
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin 
>> wrote:
>> >>
>> >> I want to bring this discussion to the dev list to gather broader
>> >> feedback, as there have been some discussions that happened over
>> multiple
>> >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
>> >> statistics to collect and how to use them.
>> >>
>> >> There are some basic statistics on columns that are obvious to use and
>> we
>> >> don't need to debate these: estimated size (in bytes), row count, min,
>> max,
>> >> number of nulls, number of distinct values, average column length, max
>> >> column length.
>> >>
>> >> In addition, we want to be able to estimate selectivity for equality
>> and
>> >> range predicates better, especially taking into account skewed values
>> and
>> >> outliers.
>> >>
>> >> Before I dive into the different options, let me first explain
>> count-min
>> >> sketch: Count-min sketch is a common sketch algorithm that tracks
>> frequency
>> >> counts. It has the following nice properties:
>> >> - sublinear space
>> >> - can be generated in one-pass in a streaming fashion
>> >> - can be incrementally maintained (i.e. for appending new data)
>> >> - it's already implemented in Spark
>> >> - more accurate for frequent values, and less accurate for
>> less-frequent
>> >> values, i.e. it tracks skewed values well.
>> >> - easy to compute inner product, i.e. trivial to compute the count-min
>> >> sketch of a join given two count-min sketches of the join tables
>> >>
>> >>
>> >> Proposal 1 is is to use a combination of count-min sketch and
>> equi-height
>> >> histograms. In this case, count-min sketch will be used for selectivity
>> >> estimation on equality predicates, and histogram will be used on range
>> >> predicates.
>> >>
>> >> Proposal 2 is to just use count-min sketch on equality predicates, and
>> >> then simple selected_range / (max - min) will be used for range
>> predicates.
>> >> This will be less accurate than using histogram, but simpler because we
>> >> don't need to collect histograms.
>> >>
>> >> Proposal 3 is a variant of proposal 2, and takes into account that
>> skewed
>> >> values can impact selectivity heavily. In 3, we track the list of heavy
>> >>

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Reynold Xin

Historically tpcds and tpch. There is certainly a chance of overfitting one
or two benchmarks. Note that those will probably be impacted more by the
way we set the parameters for CBO rather than using x or y for summary
statistics.

On Monday, November 14, 2016, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Do we have any query workloads for which we can benchmark these
> proposals in terms of performance ?
>
> Thanks
> Shivaram
>
> On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin  > wrote:
> > One additional note: in terms of size, the size of a count-min sketch
> with
> > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
> >
> > To look up what that means, see
> > http://spark.apache.org/docs/latest/api/java/org/apache/
> spark/util/sketch/CountMinSketch.html
> >
> >
> >
> >
> >
> > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin  > wrote:
> >>
> >> I want to bring this discussion to the dev list to gather broader
> >> feedback, as there have been some discussions that happened over
> multiple
> >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
> >> statistics to collect and how to use them.
> >>
> >> There are some basic statistics on columns that are obvious to use and
> we
> >> don't need to debate these: estimated size (in bytes), row count, min,
> max,
> >> number of nulls, number of distinct values, average column length, max
> >> column length.
> >>
> >> In addition, we want to be able to estimate selectivity for equality and
> >> range predicates better, especially taking into account skewed values
> and
> >> outliers.
> >>
> >> Before I dive into the different options, let me first explain count-min
> >> sketch: Count-min sketch is a common sketch algorithm that tracks
> frequency
> >> counts. It has the following nice properties:
> >> - sublinear space
> >> - can be generated in one-pass in a streaming fashion
> >> - can be incrementally maintained (i.e. for appending new data)
> >> - it's already implemented in Spark
> >> - more accurate for frequent values, and less accurate for less-frequent
> >> values, i.e. it tracks skewed values well.
> >> - easy to compute inner product, i.e. trivial to compute the count-min
> >> sketch of a join given two count-min sketches of the join tables
> >>
> >>
> >> Proposal 1 is is to use a combination of count-min sketch and
> equi-height
> >> histograms. In this case, count-min sketch will be used for selectivity
> >> estimation on equality predicates, and histogram will be used on range
> >> predicates.
> >>
> >> Proposal 2 is to just use count-min sketch on equality predicates, and
> >> then simple selected_range / (max - min) will be used for range
> predicates.
> >> This will be less accurate than using histogram, but simpler because we
> >> don't need to collect histograms.
> >>
> >> Proposal 3 is a variant of proposal 2, and takes into account that
> skewed
> >> values can impact selectivity heavily. In 3, we track the list of heavy
> >> hitters (HH, most frequent items) along with count-min sketch on the
> column.
> >> Then:
> >> - use count-min sketch on equality predicates
> >> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
> >> (max - min)
> >>
> >> Proposal 4 is to not use any sketch, and use histogram for high
> >> cardinality columns, and exact (value, frequency) pairs for low
> cardinality
> >> columns (e.g. num distinct value <= 255).
> >>
> >> Proposal 5 is a variant of proposal 4, and adapts it to track exact
> >> (value, frequency) pairs for the most frequent values only, so we can
> still
> >> have that for high cardinality columns. This is actually very similar to
> >> count-min sketch, but might use less space, although requiring two
> passes to
> >> compute the initial value, and more difficult to compute the inner
> product
> >> for joins.
> >>
> >>
> >>
> >
>

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Shivaram Venkataraman

Do we have any query workloads for which we can benchmark these
proposals in terms of performance ?

Thanks
Shivaram

On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin  wrote:
> One additional note: in terms of size, the size of a count-min sketch with
> eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
>
> To look up what that means, see
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html
>
>
>
>
>
> On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin  wrote:
>>
>> I want to bring this discussion to the dev list to gather broader
>> feedback, as there have been some discussions that happened over multiple
>> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
>> statistics to collect and how to use them.
>>
>> There are some basic statistics on columns that are obvious to use and we
>> don't need to debate these: estimated size (in bytes), row count, min, max,
>> number of nulls, number of distinct values, average column length, max
>> column length.
>>
>> In addition, we want to be able to estimate selectivity for equality and
>> range predicates better, especially taking into account skewed values and
>> outliers.
>>
>> Before I dive into the different options, let me first explain count-min
>> sketch: Count-min sketch is a common sketch algorithm that tracks frequency
>> counts. It has the following nice properties:
>> - sublinear space
>> - can be generated in one-pass in a streaming fashion
>> - can be incrementally maintained (i.e. for appending new data)
>> - it's already implemented in Spark
>> - more accurate for frequent values, and less accurate for less-frequent
>> values, i.e. it tracks skewed values well.
>> - easy to compute inner product, i.e. trivial to compute the count-min
>> sketch of a join given two count-min sketches of the join tables
>>
>>
>> Proposal 1 is is to use a combination of count-min sketch and equi-height
>> histograms. In this case, count-min sketch will be used for selectivity
>> estimation on equality predicates, and histogram will be used on range
>> predicates.
>>
>> Proposal 2 is to just use count-min sketch on equality predicates, and
>> then simple selected_range / (max - min) will be used for range predicates.
>> This will be less accurate than using histogram, but simpler because we
>> don't need to collect histograms.
>>
>> Proposal 3 is a variant of proposal 2, and takes into account that skewed
>> values can impact selectivity heavily. In 3, we track the list of heavy
>> hitters (HH, most frequent items) along with count-min sketch on the column.
>> Then:
>> - use count-min sketch on equality predicates
>> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
>> (max - min)
>>
>> Proposal 4 is to not use any sketch, and use histogram for high
>> cardinality columns, and exact (value, frequency) pairs for low cardinality
>> columns (e.g. num distinct value <= 255).
>>
>> Proposal 5 is a variant of proposal 4, and adapts it to track exact
>> (value, frequency) pairs for the most frequent values only, so we can still
>> have that for high cardinality columns. This is actually very similar to
>> count-min sketch, but might use less space, although requiring two passes to
>> compute the initial value, and more difficult to compute the inner product
>> for joins.
>>
>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: statistics collection and propagation for cost-based optimizer

2016-11-13 Thread Reynold Xin

One additional note: in terms of size, the size of a count-min sketch with
eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.

To look up what that means, see
http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html





On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin  wrote:

> I want to bring this discussion to the dev list to gather broader
> feedback, as there have been some discussions that happened over multiple
> JIRA tickets (SPARK-16026
> , etc) and GitHub pull
> requests about what statistics to collect and how to use them.
>
> There are some basic statistics on columns that are obvious to use and we
> don't need to debate these: estimated size (in bytes), row count, min, max,
> number of nulls, number of distinct values, average column length, max
> column length.
>
> In addition, we want to be able to estimate selectivity for equality and
> range predicates better, especially taking into account skewed values and
> outliers.
>
> Before I dive into the different options, let me first explain count-min
> sketch: Count-min sketch is a common sketch algorithm that tracks frequency
> counts. It has the following nice properties:
> - sublinear space
> - can be generated in one-pass in a streaming fashion
> - can be incrementally maintained (i.e. for appending new data)
> - it's already implemented in Spark
> - more accurate for frequent values, and less accurate for less-frequent
> values, i.e. it tracks skewed values well.
> - easy to compute inner product, i.e. trivial to compute the count-min
> sketch of a join given two count-min sketches of the join tables
>
>
> Proposal 1 is is to use a combination of count-min sketch and equi-height
> histograms. In this case, count-min sketch will be used for selectivity
> estimation on equality predicates, and histogram will be used on range
> predicates.
>
> Proposal 2 is to just use count-min sketch on equality predicates, and
> then simple selected_range / (max - min) will be used for range predicates.
> This will be less accurate than using histogram, but simpler because we
> don't need to collect histograms.
>
> Proposal 3 is a variant of proposal 2, and takes into account that skewed
> values can impact selectivity heavily. In 3, we track the list of heavy
> hitters (HH, most frequent items) along with count-min sketch on the
> column. Then:
> - use count-min sketch on equality predicates
> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
> (max - min)
>
> Proposal 4 is to not use any sketch, and use histogram for high
> cardinality columns, and exact (value, frequency) pairs for low cardinality
> columns (e.g. num distinct value <= 255).
>
> Proposal 5 is a variant of proposal 4, and adapts it to track exact
> (value, frequency) pairs for the most frequent values only, so we can still
> have that for high cardinality columns. This is actually very similar to
> count-min sketch, but might use less space, although requiring two passes
> to compute the initial value, and more difficult to compute the inner
> product for joins.
>
>
>
>

statistics collection and propagation for cost-based optimizer

2016-11-13 Thread Reynold Xin

I want to bring this discussion to the dev list to gather broader feedback,
as there have been some discussions that happened over multiple JIRA
tickets (SPARK-16026 ,
etc) and GitHub pull requests about what statistics to collect and how to
use them.

There are some basic statistics on columns that are obvious to use and we
don't need to debate these: estimated size (in bytes), row count, min, max,
number of nulls, number of distinct values, average column length, max
column length.

In addition, we want to be able to estimate selectivity for equality and
range predicates better, especially taking into account skewed values and
outliers.

Before I dive into the different options, let me first explain count-min
sketch: Count-min sketch is a common sketch algorithm that tracks frequency
counts. It has the following nice properties:
- sublinear space
- can be generated in one-pass in a streaming fashion
- can be incrementally maintained (i.e. for appending new data)
- it's already implemented in Spark
- more accurate for frequent values, and less accurate for less-frequent
values, i.e. it tracks skewed values well.
- easy to compute inner product, i.e. trivial to compute the count-min
sketch of a join given two count-min sketches of the join tables


Proposal 1 is is to use a combination of count-min sketch and equi-height
histograms. In this case, count-min sketch will be used for selectivity
estimation on equality predicates, and histogram will be used on range
predicates.

Proposal 2 is to just use count-min sketch on equality predicates, and then
simple selected_range / (max - min) will be used for range predicates. This
will be less accurate than using histogram, but simpler because we don't
need to collect histograms.

Proposal 3 is a variant of proposal 2, and takes into account that skewed
values can impact selectivity heavily. In 3, we track the list of heavy
hitters (HH, most frequent items) along with count-min sketch on the
column. Then:
- use count-min sketch on equality predicates
- for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
(max - min)

Proposal 4 is to not use any sketch, and use histogram for high cardinality
columns, and exact (value, frequency) pairs for low cardinality columns
(e.g. num distinct value <= 255).

Proposal 5 is a variant of proposal 4, and adapts it to track exact (value,
frequency) pairs for the most frequent values only, so we can still have
that for high cardinality columns. This is actually very similar to
count-min sketch, but might use less space, although requiring two passes
to compute the initial value, and more difficult to compute the inner
product for joins.

RE: statistics collection and propagation for cost-based optimizer

Re: statistics collection and propagation for cost-based optimizer

Re: statistics collection and propagation for cost-based optimizer

Re: statistics collection and propagation for cost-based optimizer

Re: statistics collection and propagation for cost-based optimizer

Re: statistics collection and propagation for cost-based optimizer

statistics collection and propagation for cost-based optimizer

7 matches

Site Navigation

Mail list logo

Footer information