[ 
https://issues.apache.org/jira/browse/SPARK-23750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407248#comment-16407248
 ] 

Apache Spark commented on SPARK-23750:
--------------------------------------

User 'ioana-delaney' has created a pull request for this issue:
https://github.com/apache/spark/pull/20868

> [Performance] Inner Join Elimination based on Informational RI constraints
> --------------------------------------------------------------------------
>
>                 Key: SPARK-23750
>                 URL: https://issues.apache.org/jira/browse/SPARK-23750
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Ioana Delaney
>            Priority: Major
>
> +*Inner Join Elimination based on Informational RI constraints*+
> This transformation detects RI joins and eliminates the parent/PK table if 
> none of its columns, other than the PK columns, are referenced in the query.
> Typical examples that benefit from this rewrite are queries over complex 
> views.
> *View using TPC-DS schema:*
> {code}
> create view customer_purchases_2002 (id, last, first, product, store_id, 
> month, quantity) as
> select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, 
> d_moy, ss_quantity 
> from store_sales, date_dim, customer, item, store
> where d_date_sk = ss_sold_date_sk and
>       c_customer_sk = ss_customer_sk and 
>       i_item_sk = ss_item_sk and
>       s_store_sk = ss_store_sk and
>       d_year = 2002
> {code}
> The view returns customer purchases made in year 2002. It is a join between 
> fact table _store_sales_ and dimensions _customer_, _item,_ _store_, and 
> _date_. The tables are joined using RI predicates.
> If we write a query that only selects a subset of columns from the view, for 
> example, we are only interested in the items bought and not the stores, 
> internally, the Optimizer, will first merge the view into the query, and 
> then, based on the _primary key – foreign key_ join predicate analysis, it 
> will decide that the join with the _store_ table is not needed, and therefore 
> the _store_ table is removed.
> *Query:*
> {code}
> select id, first, last, product, quantity 
> from customer_purchases_2002
> where product like ‘bicycle%’ and
>       month between 1 and 2
> {code}
> *Internal query after view expansion:*
> {code}
> select c_customer_id as id, c_first_name as first, c_last_name as last,
>            i_product_name as product,ss_quantity as quantity 
> from store_sales, date_dim, customer, item, store
> where d_date_sk = ss_sold_date_sk and
>       c_customer_sk = ss_customer_sk and 
>       i_item_sk = ss_item_sk and
>       s_store_sk = ss_store_sk and
>       d_year = 2002 and
>       month between 1 and 2 and
>       product like ‘bicycle%’
> {code}
> *Internal optimized query after join elimination:*
> {code:java}
> select c_customer_id as id, c_first_name as first, c_last_name as last,
>            i_product_name as product,ss_quantity as quantity 
> from store_sales, date_dim, customer, item
> where d_date_sk = ss_sold_date_sk and
>       c_customer_sk = ss_customer_sk and 
>       i_item_sk = ss_item_sk and
>       d_year = 2002 and
>       month between 1 and 2 and
>       product like ‘bicycle%’
> {code}
> The join with _store_ table can be removed since no columns are retrieved 
> from the table, and every row from the _store_sales_ fact table will find a 
> match in _store_ based on the RI relationship.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to