[ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15556305#comment-15556305
 ] 

Ioana Delaney commented on SPARK-17626:
---------------------------------------

[~rxin] Hi Reynold, Thank you for your reply. My argument for why they 
complement each other is the following. A star join has a clear join pattern 
established based on referential integrity constraints among the fact and the 
dimension tables. It is desirable to identify and preserve such patterns in a 
query as they offer optimal join executions among those particular set of 
tables. Cost based optimizer has a generic purpose of finding the best 
execution plan based on tables’ statistics and a set of cost formulas. Without 
additional star schema knowledge, CBO may destroy the fact-dimension join 
patterns by mixing in tables outside the RI relationship, for example. 
Moreover, a query may have multiple fact tables with more than one star-join 
plan. Ideally, the star-joins should be planned in the form of a bushy-join 
plan that preserves each individual star plan. Without additional knowledge, it 
might be hard for CBO to observe such join patterns. Therefore, star schema 
detection may provide hints to CBO to consider certain planning patterns. Then, 
within an identified star join plan, the cost based optimizer will offer the 
best join order among the dimension tables based on their cardinality, 
predicate selectivity, etc.

Our current implementation uses a very simple algorithm, that applies generally 
accepted heuristics, and shows good performance results. More improvements can 
be made with support for plan cardinality, predicate selectivity estimates, 
and, in the near future, maybe support for informational RI constraints.

Regarding the need for a specialized operator, it might not be necessary given 
the current Spark runtime implementation. I didn’t get a chance to explore 
spark code gen and runtime areas. But I think that in general, knowing that a 
certain sequence of joins is a star-plan, with or without a specialized 
operator, might open up runtime for various optimizations.




> TPC-DS performance improvements using star-schema heuristics
> ------------------------------------------------------------
>
>                 Key: SPARK-17626
>                 URL: https://issues.apache.org/jira/browse/SPARK-17626
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Ioana Delaney
>            Priority: Critical
>         Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed                 99
> Failed                  0
> Total q time (s)   14,962
> Max time            1,467
> Min time                3
> Mean time             145
> Geomean                44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)              -19%     
> Mean time improved (%)               -19%
> Geomean improved (%)                 -24%
> End to end improved (seconds)      -3,603
> Number of queries improved (>10%)      45
> Number of queries degraded (>10%)       6
> Number of queries unchanged            48
> Top 10 queries improved (%)          -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to