Ioana Delaney created SPARK-23752: ------------------------------------- Summary: [Performance] Existential Subquery to Inner Join Key: SPARK-23752 URL: https://issues.apache.org/jira/browse/SPARK-23752 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ioana Delaney
*+Existential Subquery to Inner Join+* Another enhancement that uses Informational Constraints is existential subquery to inner join. This rewrite converts an existential subquery to an inner join, and thus provides alternative join choices for the Optimizer based on the selectivity of the tables. An example using TPC-DS schema is shown below. {code} select c_first_name, c_last_name, c_email_address from customer c where EXISTS (select * from store_sales, date_dim where c.c_customer_sk = ss_customer_sk and ss_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 4 and 4+3) {code} Spark uses left semi-join to evaluated existential subqueries. A left semi-join will return a row from the outer table if there is at least one match in the inner. Semi-join is a general used technique to rewrite existential subqueries, but it has some limitations as it imposes a certain order of the joined table. In this case the large fact table _store_sales_ has to be on the inner of the join. A more efficient execution can be obtained if the subquery is converted to a regular Inner join. This will allow the Optimizer to choose better join orders. Converting a subquery to inner join is possible if either the subquery produces at most one row or, by introducing a _Distinct_ on the outer table’s row key in order to remove the duplicate rows that will result after the inner join and thus to enforce the semantics of the subquery. As a key for the outer, we can use the primary key of the _customer_ table. *Internal query after rewrite:* {code} select distinct c_customer_sk /*PK */, c_first_name, c_last_name, c_email_address from customer c, store_sales, date_dim where c.c_customer_sk = ss_customer_sk and ss_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 4 and 4+3 {code} \\ *Example performance results using 1TB TPC-DS benchmark:* \\ ||TPC-DS Query||spark-2.2||spark-2.2 w/ sub2join||Query speedup|| || ||(secs)||(secs) || || |Q10|355|190|2x| |Q16|1394|706|2x| |Q35|462|285|1.5x| |Q69|327|173|1.5x| |Q94|603|307|2x| -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org