Re: statistics collection and propagation for cost-based optimizer

Reynold Xin Mon, 14 Nov 2016 22:48:56 -0800

They are not yet complete. The benchmark was done with an implementation of
cost-based optimizer Huawei had internally for Spark 1.5 (or some even
older version).


On Mon, Nov 14, 2016 at 10:46 PM, Yogesh Mahajan <ymaha...@snappydata.io>
wrote:

> It looks like Huawei team have run TPC-H benchmark and some real-world
> test cases and their results show good performance gain in 2X-5X speedup
> depending on data volume.
> Can we share the numbers and query wise rational behind the gain? Are
> there anything done on spark master yet? Or the implementation is not yet
> completed?
>
> Thanks,
> Yogesh Mahajan
> http://www.snappydata.io/blog <http://snappydata.io>
>
> On Tue, Nov 15, 2016 at 12:03 PM, Yogesh Mahajan <ymaha...@snappydata.io>
> wrote:
>
>>
>> Thanks Reynold for the detailed proposals. A few questions/clarifications
>> -
>>
>> 1) How the existing rule based operator co-exist with CBO? The existing
>> rules are heuristics/empirical based, i am assuming rules like predicate
>> pushdown or project pruning will co-exist with CBO and we just want to
>> accurately estimate the filter factor and cardinality to make it more
>> accurate? With predicate pushdown, a filter is mostly executed at an early
>> stage of a query plan and the cardinality estimate of a predicate can
>> improve the precision of cardinality estimates.
>>
>> 2. Will the query transformations be now based on the cost calculation?
>> If yes, then what happens when the cost of execution of the transformed
>> statement is higher than the cost of untransformed query?
>>
>> 3. Is there any upper limit on space used for storing the frequency
>> histogram? 255? And in case of more distinct values, we can even consider
>> height balanced histogram in Oracle.
>>
>> 4. The first three proposals are new and not mentioned in the CBO design
>> spec. CMS is good but it's less accurate compared the traditional
>> histograms. This is a  major trade-off  we need to consider.
>>
>> 5. Are we going to consider system statistics- such as speed of CPU or
>> disk access as a cost function? How about considering shuffle cost, output
>> partitioning etc?
>>
>> 6. Like the current rule based optimizer, will this CBO also be an
>> 'extensible optimizer'? If yes, what functionality users can extend?
>>
>> 7. Why this CBO will be disabled by default? “spark.sql.cbo" is false by
>> default as it's just experimental ?
>>
>> 8. ANALYZE TABLE, analyzeColumns etc ... all look good.
>>
>> 9. From the release point of view, how this is planned ? Will all this be
>> implemented in one go or in phases?
>>
>> Thanks,
>> Yogesh Mahajan
>> http://www.snappydata.io/blog <http://snappydata.io>
>>
>> On Mon, Nov 14, 2016 at 11:25 PM, Reynold Xin <r...@databricks.com>
>> wrote:
>>
>>> Historically tpcds and tpch. There is certainly a chance of overfitting
>>> one or two benchmarks. Note that those will probably be impacted more by
>>> the way we set the parameters for CBO rather than using x or y for summary
>>> statistics.
>>>
>>>
>>> On Monday, November 14, 2016, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
>>>> Do we have any query workloads for which we can benchmark these
>>>> proposals in terms of performance ?
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>> > One additional note: in terms of size, the size of a count-min sketch
>>>> with
>>>> > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
>>>> >
>>>> > To look up what that means, see
>>>> > http://spark.apache.org/docs/latest/api/java/org/apache/spar
>>>> k/util/sketch/CountMinSketch.html
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>> >>
>>>> >> I want to bring this discussion to the dev list to gather broader
>>>> >> feedback, as there have been some discussions that happened over
>>>> multiple
>>>> >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
>>>> >> statistics to collect and how to use them.
>>>> >>
>>>> >> There are some basic statistics on columns that are obvious to use
>>>> and we
>>>> >> don't need to debate these: estimated size (in bytes), row count,
>>>> min, max,
>>>> >> number of nulls, number of distinct values, average column length,
>>>> max
>>>> >> column length.
>>>> >>
>>>> >> In addition, we want to be able to estimate selectivity for equality
>>>> and
>>>> >> range predicates better, especially taking into account skewed
>>>> values and
>>>> >> outliers.
>>>> >>
>>>> >> Before I dive into the different options, let me first explain
>>>> count-min
>>>> >> sketch: Count-min sketch is a common sketch algorithm that tracks
>>>> frequency
>>>> >> counts. It has the following nice properties:
>>>> >> - sublinear space
>>>> >> - can be generated in one-pass in a streaming fashion
>>>> >> - can be incrementally maintained (i.e. for appending new data)
>>>> >> - it's already implemented in Spark
>>>> >> - more accurate for frequent values, and less accurate for
>>>> less-frequent
>>>> >> values, i.e. it tracks skewed values well.
>>>> >> - easy to compute inner product, i.e. trivial to compute the
>>>> count-min
>>>> >> sketch of a join given two count-min sketches of the join tables
>>>> >>
>>>> >>
>>>> >> Proposal 1 is is to use a combination of count-min sketch and
>>>> equi-height
>>>> >> histograms. In this case, count-min sketch will be used for
>>>> selectivity
>>>> >> estimation on equality predicates, and histogram will be used on
>>>> range
>>>> >> predicates.
>>>> >>
>>>> >> Proposal 2 is to just use count-min sketch on equality predicates,
>>>> and
>>>> >> then simple selected_range / (max - min) will be used for range
>>>> predicates.
>>>> >> This will be less accurate than using histogram, but simpler because
>>>> we
>>>> >> don't need to collect histograms.
>>>> >>
>>>> >> Proposal 3 is a variant of proposal 2, and takes into account that
>>>> skewed
>>>> >> values can impact selectivity heavily. In 3, we track the list of
>>>> heavy
>>>> >> hitters (HH, most frequent items) along with count-min sketch on the
>>>> column.
>>>> >> Then:
>>>> >> - use count-min sketch on equality predicates
>>>> >> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) +
>>>> range /
>>>> >> (max - min)
>>>> >>
>>>> >> Proposal 4 is to not use any sketch, and use histogram for high
>>>> >> cardinality columns, and exact (value, frequency) pairs for low
>>>> cardinality
>>>> >> columns (e.g. num distinct value <= 255).
>>>> >>
>>>> >> Proposal 5 is a variant of proposal 4, and adapts it to track exact
>>>> >> (value, frequency) pairs for the most frequent values only, so we
>>>> can still
>>>> >> have that for high cardinality columns. This is actually very
>>>> similar to
>>>> >> count-min sketch, but might use less space, although requiring two
>>>> passes to
>>>> >> compute the initial value, and more difficult to compute the inner
>>>> product
>>>> >> for joins.
>>>> >>
>>>> >>
>>>> >>
>>>> >
>>>>
>>>
>>
>

Re: statistics collection and propagation for cost-based optimizer

Reply via email to