[ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15559044#comment-15559044
 ] 

Ron Hu commented on SPARK-17626:
--------------------------------

In the CBO design spec we posted in 
https://issues.apache.org/jira/browse/SPARK-16026,
we illustrated a Multi-way Join Ordering Optimization algorithm using dynamic 
programming technique.  This algorithm should be able to pick up the best join 
re-ordering plan. It is possible that the search space is big.  We need some 
heuristics to reduce the search space. 

As Zhenhua pointed out, we can identify all the primary-key/foreign-key joins 
as we collect number of distinct values to infer whether or not a join column 
is a primary key.  If a join relation has primary key join column, then it is a 
dimension table.  If a join relation has foreign key columns, then it is a fact 
table.  Once a fact table is identified, we form a star schema by finding out 
all the dimension tables that have join conditions with the given fact table.

As for the selectivity hint, we do not need selectivity hint to deal with 
comparison expression like:
  column_name  <Comparison_operator>  constant_value
where a comparison operator is =, <, <=, >, >=, etc. 
This is because, with the histogram we are implementing now in CBO, we can find 
the filtering selectivity properly.  However, for the following cases, a 
selectivity hint will be helpful.

Case 1:
  WHERE o_comment not like '%special%request%'  /* TPC-H Q13 */
Histogram cannot provide such detailed statistics information for a string 
pattern which can a complex regular expression.

Case 2:
  WHERE l_commitdate < l_receiptdate /* TPC-H Q4 */
Today we define one-dimensional histogram to keep track the data distribution 
of a single column.  We do not handle the non-equal relationship between two 
columns.


> TPC-DS performance improvements using star-schema heuristics
> ------------------------------------------------------------
>
>                 Key: SPARK-17626
>                 URL: https://issues.apache.org/jira/browse/SPARK-17626
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Ioana Delaney
>            Priority: Critical
>         Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed                 99
> Failed                  0
> Total q time (s)   14,962
> Max time            1,467
> Min time                3
> Mean time             145
> Geomean                44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)              -19%     
> Mean time improved (%)               -19%
> Geomean improved (%)                 -24%
> End to end improved (seconds)      -3,603
> Number of queries improved (>10%)      45
> Number of queries degraded (>10%)       6
> Number of queries unchanged            48
> Top 10 queries improved (%)          -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to