[ https://issues.apache.org/jira/browse/SPARK-23752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-23752: ---------------------------------- Affects Version/s: (was: 3.0.0) 3.1.0 > [Performance] Existential Subquery to Inner Join > ------------------------------------------------ > > Key: SPARK-23752 > URL: https://issues.apache.org/jira/browse/SPARK-23752 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 3.1.0 > Reporter: Ioana Delaney > Priority: Major > > *+Existential Subquery to Inner Join+* > Another enhancement that uses Informational Constraints is existential > subquery to inner join. This rewrite converts an existential subquery to an > inner join, and thus provides alternative join choices for the Optimizer > based on the selectivity of the tables. > An example using TPC-DS schema is shown below. > {code} > select c_first_name, c_last_name, c_email_address > from customer c > where EXISTS (select * > from store_sales, date_dim > where c.c_customer_sk = ss_customer_sk and > ss_sold_date_sk = d_date_sk and > d_year = 2002 and > d_moy between 4 and 4+3) > {code} > Spark uses left semi-join to evaluated existential subqueries. A left > semi-join will return a row from the outer table if there is at least one > match in the inner. Semi-join is a general used technique to rewrite > existential subqueries, but it has some limitations as it imposes a certain > order of the joined table. In this case the large fact table _store_sales_ > has to be on the inner of the join. A more efficient execution can be > obtained if the subquery is converted to a regular Inner join. This will > allow the Optimizer to choose better join orders. > Converting a subquery to inner join is possible if either the subquery > produces at most one row or, by introducing a _Distinct_ on the outer table’s > row key in order to remove the duplicate rows that will result after the > inner join and thus to enforce the semantics of the subquery. As a key for > the outer, we can use the primary key of the _customer_ table. > *Internal query after rewrite:* > {code} > select distinct c_customer_sk /*PK */, c_first_name, c_last_name, > c_email_address > from customer c, store_sales, date_dim > where c.c_customer_sk = ss_customer_sk and > ss_sold_date_sk = d_date_sk and > d_year = 2002 and > d_moy between 4 and 4+3 > {code} > \\ > *Example performance results using 1TB TPC-DS benchmark:* > \\ > ||TPC-DS Query||spark-2.2||spark-2.2 w/ sub2join||Query speedup|| > || ||(secs)||(secs) || > || > |Q10|355|190|2x| > |Q16|1394|706|2x| > |Q35|462|285|1.5x| > |Q69|327|173|1.5x| > |Q94|603|307|2x| -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org