Ioana Delaney created SPARK-23752:
-------------------------------------

             Summary: [Performance] Existential Subquery to Inner Join
                 Key: SPARK-23752
                 URL: https://issues.apache.org/jira/browse/SPARK-23752
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Ioana Delaney


*+Existential Subquery to Inner Join+*

Another enhancement that uses Informational Constraints is existential subquery 
to inner join. This rewrite converts an existential subquery to an inner join, 
and thus provides alternative join choices for the Optimizer based on the 
selectivity of the tables. 

An example using TPC-DS schema is shown below.

{code}
select c_first_name, c_last_name, c_email_address
from customer c
where EXISTS (select * 
              from store_sales, date_dim
              where c.c_customer_sk = ss_customer_sk and
                    ss_sold_date_sk = d_date_sk and
                    d_year = 2002 and
                    d_moy between 4 and 4+3)
{code}

Spark uses left semi-join to evaluated existential subqueries. A left semi-join 
will return a row from the outer table if there is at least one match in the 
inner. Semi-join is a general used technique to rewrite existential subqueries, 
but it has some limitations as it imposes a certain order of the joined table. 
In this case the large fact table _store_sales_ has to be on the inner of the 
join. A more efficient execution can be obtained if the subquery is converted 
to a regular Inner join. This will allow the Optimizer to choose better join 
orders.

Converting a subquery to inner join is possible if either the subquery produces 
at most one row or, by introducing a _Distinct_ on the outer table’s row key in 
order to remove the duplicate rows that will result after the inner join and 
thus to enforce the semantics of the subquery. As a key for the outer, we can 
use the primary key of the _customer_ table.

*Internal query after rewrite:*

{code}
select distinct c_customer_sk /*PK */, c_first_name, c_last_name, 
c_email_address
from customer c, store_sales, date_dim
where c.c_customer_sk = ss_customer_sk and
       ss_sold_date_sk = d_date_sk and
       d_year = 2002 and
       d_moy between 4 and 4+3
{code}

\\

*Example performance results using 1TB TPC-DS benchmark:*

\\

||TPC-DS Query||spark-2.2||spark-2.2 w/ sub2join||Query speedup||
||                        ||(secs)||(secs)                        ||            
            ||
|Q10|355|190|2x|
|Q16|1394|706|2x|
|Q35|462|285|1.5x|
|Q69|327|173|1.5x|
|Q94|603|307|2x|








--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to