[ 
https://issues.apache.org/jira/browse/SPARK-23752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23752:
----------------------------------
    Affects Version/s:     (was: 3.0.0)
                       3.1.0

> [Performance] Existential Subquery to Inner Join
> ------------------------------------------------
>
>                 Key: SPARK-23752
>                 URL: https://issues.apache.org/jira/browse/SPARK-23752
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Ioana Delaney
>            Priority: Major
>
> *+Existential Subquery to Inner Join+*
> Another enhancement that uses Informational Constraints is existential 
> subquery to inner join. This rewrite converts an existential subquery to an 
> inner join, and thus provides alternative join choices for the Optimizer 
> based on the selectivity of the tables. 
> An example using TPC-DS schema is shown below.
> {code}
> select c_first_name, c_last_name, c_email_address
> from customer c
> where EXISTS (select * 
>               from store_sales, date_dim
>               where c.c_customer_sk = ss_customer_sk and
>                     ss_sold_date_sk = d_date_sk and
>                     d_year = 2002 and
>                     d_moy between 4 and 4+3)
> {code}
> Spark uses left semi-join to evaluated existential subqueries. A left 
> semi-join will return a row from the outer table if there is at least one 
> match in the inner. Semi-join is a general used technique to rewrite 
> existential subqueries, but it has some limitations as it imposes a certain 
> order of the joined table. In this case the large fact table _store_sales_ 
> has to be on the inner of the join. A more efficient execution can be 
> obtained if the subquery is converted to a regular Inner join. This will 
> allow the Optimizer to choose better join orders.
> Converting a subquery to inner join is possible if either the subquery 
> produces at most one row or, by introducing a _Distinct_ on the outer table’s 
> row key in order to remove the duplicate rows that will result after the 
> inner join and thus to enforce the semantics of the subquery. As a key for 
> the outer, we can use the primary key of the _customer_ table.
> *Internal query after rewrite:*
> {code}
> select distinct c_customer_sk /*PK */, c_first_name, c_last_name, 
> c_email_address
> from customer c, store_sales, date_dim
> where c.c_customer_sk = ss_customer_sk and
>        ss_sold_date_sk = d_date_sk and
>        d_year = 2002 and
>        d_moy between 4 and 4+3
> {code}
> \\
> *Example performance results using 1TB TPC-DS benchmark:*
> \\
> ||TPC-DS Query||spark-2.2||spark-2.2 w/ sub2join||Query speedup||
> ||                        ||(secs)||(secs)                        ||          
>               ||
> |Q10|355|190|2x|
> |Q16|1394|706|2x|
> |Q35|462|285|1.5x|
> |Q69|327|173|1.5x|
> |Q94|603|307|2x|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to