[ 
https://issues.apache.org/jira/browse/SPARK-17791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749888#comment-15749888
 ] 

Ioana Delaney commented on SPARK-17791:
---------------------------------------

I’ve incorporated the table and column statistics into the star join detection 
algorithm. Fact table is chosen based on table cardinality, and dimensions are 
chosen based on the RI constraints. To infer column uniqueness, the algorithm 
uses table and column statistics. It compares the number of distinct values 
with the total number of rows in the table. If their relative difference is 
within certain limits, the column is assumed to be unique. The updated design 
document is uploaded to 
https://issues.apache.org/jira/secure/attachment/12843316/StarJoinReordering1214.doc.

> Join reordering using star schema detection
> -------------------------------------------
>
>                 Key: SPARK-17791
>                 URL: https://issues.apache.org/jira/browse/SPARK-17791
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Ioana Delaney
>            Assignee: Ioana Delaney
>            Priority: Critical
>         Attachments: StarJoinReordering1214.doc
>
>
> This JIRA is a sub-task of SPARK-17626.
> The objective is to provide a consistent performance improvement for star 
> schema queries. Star schema consists of one or more fact tables referencing a 
> number of dimension tables. In general, queries against star schema are 
> expected to run fast  because of the established RI constraints among the 
> tables. This design proposes a join reordering based on natural, generally 
> accepted heuristics for star schema queries:
> * Finds the star join with the largest fact table and places it on the 
> driving arm of the left-deep join. This plan avoids large tables on the 
> inner, and thus favors hash joins. 
> * Applies the most selective dimensions early in the plan to reduce the 
> amount of data flow.
> The design description is included in the below attached document.
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to