[ https://issues.apache.org/jira/browse/SPARK-23750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407248#comment-16407248 ]
Apache Spark commented on SPARK-23750: -------------------------------------- User 'ioana-delaney' has created a pull request for this issue: https://github.com/apache/spark/pull/20868 > [Performance] Inner Join Elimination based on Informational RI constraints > -------------------------------------------------------------------------- > > Key: SPARK-23750 > URL: https://issues.apache.org/jira/browse/SPARK-23750 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 3.0.0 > Reporter: Ioana Delaney > Priority: Major > > +*Inner Join Elimination based on Informational RI constraints*+ > This transformation detects RI joins and eliminates the parent/PK table if > none of its columns, other than the PK columns, are referenced in the query. > Typical examples that benefit from this rewrite are queries over complex > views. > *View using TPC-DS schema:* > {code} > create view customer_purchases_2002 (id, last, first, product, store_id, > month, quantity) as > select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, > d_moy, ss_quantity > from store_sales, date_dim, customer, item, store > where d_date_sk = ss_sold_date_sk and > c_customer_sk = ss_customer_sk and > i_item_sk = ss_item_sk and > s_store_sk = ss_store_sk and > d_year = 2002 > {code} > The view returns customer purchases made in year 2002. It is a join between > fact table _store_sales_ and dimensions _customer_, _item,_ _store_, and > _date_. The tables are joined using RI predicates. > If we write a query that only selects a subset of columns from the view, for > example, we are only interested in the items bought and not the stores, > internally, the Optimizer, will first merge the view into the query, and > then, based on the _primary key – foreign key_ join predicate analysis, it > will decide that the join with the _store_ table is not needed, and therefore > the _store_ table is removed. > *Query:* > {code} > select id, first, last, product, quantity > from customer_purchases_2002 > where product like ‘bicycle%’ and > month between 1 and 2 > {code} > *Internal query after view expansion:* > {code} > select c_customer_id as id, c_first_name as first, c_last_name as last, > i_product_name as product,ss_quantity as quantity > from store_sales, date_dim, customer, item, store > where d_date_sk = ss_sold_date_sk and > c_customer_sk = ss_customer_sk and > i_item_sk = ss_item_sk and > s_store_sk = ss_store_sk and > d_year = 2002 and > month between 1 and 2 and > product like ‘bicycle%’ > {code} > *Internal optimized query after join elimination:* > {code:java} > select c_customer_id as id, c_first_name as first, c_last_name as last, > i_product_name as product,ss_quantity as quantity > from store_sales, date_dim, customer, item > where d_date_sk = ss_sold_date_sk and > c_customer_sk = ss_customer_sk and > i_item_sk = ss_item_sk and > d_year = 2002 and > month between 1 and 2 and > product like ‘bicycle%’ > {code} > The join with _store_ table can be removed since no columns are retrieved > from the table, and every row from the _store_sales_ fact table will find a > match in _store_ based on the RI relationship. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org