[ https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15556305#comment-15556305 ]
Ioana Delaney commented on SPARK-17626: --------------------------------------- [~rxin] Hi Reynold, Thank you for your reply. My argument for why they complement each other is the following. A star join has a clear join pattern established based on referential integrity constraints among the fact and the dimension tables. It is desirable to identify and preserve such patterns in a query as they offer optimal join executions among those particular set of tables. Cost based optimizer has a generic purpose of finding the best execution plan based on tables’ statistics and a set of cost formulas. Without additional star schema knowledge, CBO may destroy the fact-dimension join patterns by mixing in tables outside the RI relationship, for example. Moreover, a query may have multiple fact tables with more than one star-join plan. Ideally, the star-joins should be planned in the form of a bushy-join plan that preserves each individual star plan. Without additional knowledge, it might be hard for CBO to observe such join patterns. Therefore, star schema detection may provide hints to CBO to consider certain planning patterns. Then, within an identified star join plan, the cost based optimizer will offer the best join order among the dimension tables based on their cardinality, predicate selectivity, etc. Our current implementation uses a very simple algorithm, that applies generally accepted heuristics, and shows good performance results. More improvements can be made with support for plan cardinality, predicate selectivity estimates, and, in the near future, maybe support for informational RI constraints. Regarding the need for a specialized operator, it might not be necessary given the current Spark runtime implementation. I didn’t get a chance to explore spark code gen and runtime areas. But I think that in general, knowing that a certain sequence of joins is a star-plan, with or without a specialized operator, might open up runtime for various optimizations. > TPC-DS performance improvements using star-schema heuristics > ------------------------------------------------------------ > > Key: SPARK-17626 > URL: https://issues.apache.org/jira/browse/SPARK-17626 > Project: Spark > Issue Type: Umbrella > Components: SQL > Affects Versions: 2.1.0 > Reporter: Ioana Delaney > Priority: Critical > Attachments: StarSchemaJoinReordering.pptx > > > *TPC-DS performance improvements using star-schema heuristics* > \\ > \\ > TPC-DS consists of multiple snowflake schema, which are multiple star schema > with dimensions linking to dimensions. A star schema consists of a fact table > referencing a number of dimension tables. Fact table holds the main data > about a business. Dimension table, a usually smaller table, describes data > reflecting the dimension/attribute of a business. > \\ > \\ > As part of the benchmark performance investigation, we observed a pattern of > sub-optimal execution plans of large fact tables joins. Manual rewrite of > some of the queries into selective fact-dimensions joins resulted in > significant performance improvement. This prompted us to develop a simple > join reordering algorithm based on star schema detection. The performance > testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. > \\ > \\ > *Summary of the results:* > {code} > Passed 99 > Failed 0 > Total q time (s) 14,962 > Max time 1,467 > Min time 3 > Mean time 145 > Geomean 44 > {code} > *Compared to baseline* (Negative = improvement; Positive = Degradation): > {code} > End to end improved (%) -19% > Mean time improved (%) -19% > Geomean improved (%) -24% > End to end improved (seconds) -3,603 > Number of queries improved (>10%) 45 > Number of queries degraded (>10%) 6 > Number of queries unchanged 48 > Top 10 queries improved (%) -20% > {code} > Cluster: 20-node cluster with each node having: > * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 > v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. > * Total memory for the cluster: 2.5TB > * Total storage: 400TB > * Total CPU cores: 480 > Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA > Database info: > * Schema: TPCDS > * Scale factor: 1TB total space > * Storage format: Parquet with Snappy compression > Our investigation and results are included in the attached document. > There are two parts to this improvement: > # Join reordering using star schema detection > # New selectivity hint to specify the selectivity of the predicates over base > tables. Selectivity hint is optional and it was not used in the above TPC-DS > tests. > \\ -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org