Ioana Delaney created SPARK-23753: ------------------------------------- Summary: [Performance] Group By Push Down through Join Key: SPARK-23753 URL: https://issues.apache.org/jira/browse/SPARK-23753 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ioana Delaney
*Group By push down through Join* Another transformation that benefits from RI constraints is Group By push down through joins. The transformation interchanges the order of the group-by and join operations. The benefit of pushing down a group-by is that it may reduce the number of input rows to the join. On the other hand, if the join is very selective, it might make sense to execute the group by after the join. That is why this transformation is in general applied based on cost or selectivity estimates. However, if the join is an RI join, under certain conditions, it is safe to push down group by operation below the join. An example is shown below. {code} select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name, sum(ss.ss_quantity) as store_sales_quantity from store_sales ss, date_dim, customer, store where d_date_sk = ss_sold_date_sk and c_customer_sk = ss_customer_sk and s_store_sk = ss_store_sk and d_year between 2000 and 2002 group by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name {code} The query computes the quantities sold grouped by _customer_ and _store_ tables. The tables are in a _star schema_ join. The grouping columns are a super set of the join keys. The aggregate columns come from the fact table _store_sales_. The group by operation can be pushed down to the fact table _store_sales_ through the join with the _customer_ and _store_ tables. The join will not affect the partitions nor the aggregates computed by the pushed down group-by since every tuple in _store_sales_ will join with a tuple in _customer_ and _store_ tables. {code} select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name, v1.store_sales_quantity from customer, store, (select ss_customer_sk, ss_store_sk, sum(ss_quantity) as store_sales_quantity from store_sales, date_dim where d_date_sk = ss_sold_date_sk and d_year between 2000 and 2002 group by ss_customer_sk, ss_store_sk ) v1 where c_customer_sk = v1.ss_customer_sk and s_store_sk = v1.ss_store_sk {code} \\ When the query is run using a 1TB TPC-DS setup, the group by reduces the number of rows from 1.5 billion to 100 million rows and the query execution drops from about 70 secs to 30 secs, a 2x improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org