[ https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550808#comment-15550808 ]
Reynold Xin commented on SPARK-17626: ------------------------------------- Can you guys comment on the usefulness of this in the context of cost-based optimization? > TPC-DS performance improvements using star-schema heuristics > ------------------------------------------------------------ > > Key: SPARK-17626 > URL: https://issues.apache.org/jira/browse/SPARK-17626 > Project: Spark > Issue Type: Umbrella > Components: SQL > Affects Versions: 2.1.0 > Reporter: Ioana Delaney > Priority: Critical > Attachments: StarSchemaJoinReordering.pptx > > > *TPC-DS performance improvements using star-schema heuristics* > \\ > \\ > TPC-DS consists of multiple snowflake schema, which are multiple star schema > with dimensions linking to dimensions. A star schema consists of a fact table > referencing a number of dimension tables. Fact table holds the main data > about a business. Dimension table, a usually smaller table, describes data > reflecting the dimension/attribute of a business. > \\ > \\ > As part of the benchmark performance investigation, we observed a pattern of > sub-optimal execution plans of large fact tables joins. Manual rewrite of > some of the queries into selective fact-dimensions joins resulted in > significant performance improvement. This prompted us to develop a simple > join reordering algorithm based on star schema detection. The performance > testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. > \\ > \\ > *Summary of the results:* > {code} > Passed 99 > Failed 0 > Total q time (s) 14,962 > Max time 1,467 > Min time 3 > Mean time 145 > Geomean 44 > {code} > *Compared to baseline* (Negative = improvement; Positive = Degradation): > {code} > End to end improved (%) -19% > Mean time improved (%) -19% > Geomean improved (%) -24% > End to end improved (seconds) -3,603 > Number of queries improved (>10%) 45 > Number of queries degraded (>10%) 6 > Number of queries unchanged 48 > Top 10 queries improved (%) -20% > {code} > Cluster: 20-node cluster with each node having: > * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 > v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. > * Total memory for the cluster: 2.5TB > * Total storage: 400TB > * Total CPU cores: 480 > Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA > Database info: > * Schema: TPCDS > * Scale factor: 1TB total space > * Storage format: Parquet with Snappy compression > Our investigation and results are included in the attached document. > There are two parts to this improvement: > # Join reordering using star schema detection > # New selectivity hint to specify the selectivity of the predicates over base > tables. Selectivity hint is optional and it was not used in the above TPC-DS > tests. > \\ -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org