They are not yet complete. The benchmark was done with an implementation of cost-based optimizer Huawei had internally for Spark 1.5 (or some even older version).
On Mon, Nov 14, 2016 at 10:46 PM, Yogesh Mahajan <ymaha...@snappydata.io> wrote: > It looks like Huawei team have run TPC-H benchmark and some real-world > test cases and their results show good performance gain in 2X-5X speedup > depending on data volume. > Can we share the numbers and query wise rational behind the gain? Are > there anything done on spark master yet? Or the implementation is not yet > completed? > > Thanks, > Yogesh Mahajan > http://www.snappydata.io/blog <http://snappydata.io> > > On Tue, Nov 15, 2016 at 12:03 PM, Yogesh Mahajan <ymaha...@snappydata.io> > wrote: > >> >> Thanks Reynold for the detailed proposals. A few questions/clarifications >> - >> >> 1) How the existing rule based operator co-exist with CBO? The existing >> rules are heuristics/empirical based, i am assuming rules like predicate >> pushdown or project pruning will co-exist with CBO and we just want to >> accurately estimate the filter factor and cardinality to make it more >> accurate? With predicate pushdown, a filter is mostly executed at an early >> stage of a query plan and the cardinality estimate of a predicate can >> improve the precision of cardinality estimates. >> >> 2. Will the query transformations be now based on the cost calculation? >> If yes, then what happens when the cost of execution of the transformed >> statement is higher than the cost of untransformed query? >> >> 3. Is there any upper limit on space used for storing the frequency >> histogram? 255? And in case of more distinct values, we can even consider >> height balanced histogram in Oracle. >> >> 4. The first three proposals are new and not mentioned in the CBO design >> spec. CMS is good but it's less accurate compared the traditional >> histograms. This is a major trade-off we need to consider. >> >> 5. Are we going to consider system statistics- such as speed of CPU or >> disk access as a cost function? How about considering shuffle cost, output >> partitioning etc? >> >> 6. Like the current rule based optimizer, will this CBO also be an >> 'extensible optimizer'? If yes, what functionality users can extend? >> >> 7. Why this CBO will be disabled by default? “spark.sql.cbo" is false by >> default as it's just experimental ? >> >> 8. ANALYZE TABLE, analyzeColumns etc ... all look good. >> >> 9. From the release point of view, how this is planned ? Will all this be >> implemented in one go or in phases? >> >> Thanks, >> Yogesh Mahajan >> http://www.snappydata.io/blog <http://snappydata.io> >> >> On Mon, Nov 14, 2016 at 11:25 PM, Reynold Xin <r...@databricks.com> >> wrote: >> >>> Historically tpcds and tpch. There is certainly a chance of overfitting >>> one or two benchmarks. Note that those will probably be impacted more by >>> the way we set the parameters for CBO rather than using x or y for summary >>> statistics. >>> >>> >>> On Monday, November 14, 2016, Shivaram Venkataraman < >>> shiva...@eecs.berkeley.edu> wrote: >>> >>>> Do we have any query workloads for which we can benchmark these >>>> proposals in terms of performance ? >>>> >>>> Thanks >>>> Shivaram >>>> >>>> On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <r...@databricks.com> >>>> wrote: >>>> > One additional note: in terms of size, the size of a count-min sketch >>>> with >>>> > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes. >>>> > >>>> > To look up what that means, see >>>> > http://spark.apache.org/docs/latest/api/java/org/apache/spar >>>> k/util/sketch/CountMinSketch.html >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >> >>>> >> I want to bring this discussion to the dev list to gather broader >>>> >> feedback, as there have been some discussions that happened over >>>> multiple >>>> >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what >>>> >> statistics to collect and how to use them. >>>> >> >>>> >> There are some basic statistics on columns that are obvious to use >>>> and we >>>> >> don't need to debate these: estimated size (in bytes), row count, >>>> min, max, >>>> >> number of nulls, number of distinct values, average column length, >>>> max >>>> >> column length. >>>> >> >>>> >> In addition, we want to be able to estimate selectivity for equality >>>> and >>>> >> range predicates better, especially taking into account skewed >>>> values and >>>> >> outliers. >>>> >> >>>> >> Before I dive into the different options, let me first explain >>>> count-min >>>> >> sketch: Count-min sketch is a common sketch algorithm that tracks >>>> frequency >>>> >> counts. It has the following nice properties: >>>> >> - sublinear space >>>> >> - can be generated in one-pass in a streaming fashion >>>> >> - can be incrementally maintained (i.e. for appending new data) >>>> >> - it's already implemented in Spark >>>> >> - more accurate for frequent values, and less accurate for >>>> less-frequent >>>> >> values, i.e. it tracks skewed values well. >>>> >> - easy to compute inner product, i.e. trivial to compute the >>>> count-min >>>> >> sketch of a join given two count-min sketches of the join tables >>>> >> >>>> >> >>>> >> Proposal 1 is is to use a combination of count-min sketch and >>>> equi-height >>>> >> histograms. In this case, count-min sketch will be used for >>>> selectivity >>>> >> estimation on equality predicates, and histogram will be used on >>>> range >>>> >> predicates. >>>> >> >>>> >> Proposal 2 is to just use count-min sketch on equality predicates, >>>> and >>>> >> then simple selected_range / (max - min) will be used for range >>>> predicates. >>>> >> This will be less accurate than using histogram, but simpler because >>>> we >>>> >> don't need to collect histograms. >>>> >> >>>> >> Proposal 3 is a variant of proposal 2, and takes into account that >>>> skewed >>>> >> values can impact selectivity heavily. In 3, we track the list of >>>> heavy >>>> >> hitters (HH, most frequent items) along with count-min sketch on the >>>> column. >>>> >> Then: >>>> >> - use count-min sketch on equality predicates >>>> >> - for range predicates, estimatedFreq = sum(freq(HHInRange)) + >>>> range / >>>> >> (max - min) >>>> >> >>>> >> Proposal 4 is to not use any sketch, and use histogram for high >>>> >> cardinality columns, and exact (value, frequency) pairs for low >>>> cardinality >>>> >> columns (e.g. num distinct value <= 255). >>>> >> >>>> >> Proposal 5 is a variant of proposal 4, and adapts it to track exact >>>> >> (value, frequency) pairs for the most frequent values only, so we >>>> can still >>>> >> have that for high cardinality columns. This is actually very >>>> similar to >>>> >> count-min sketch, but might use less space, although requiring two >>>> passes to >>>> >> compute the initial value, and more difficult to compute the inner >>>> product >>>> >> for joins. >>>> >> >>>> >> >>>> >> >>>> > >>>> >>> >> >