[ https://issues.apache.org/jira/browse/SPARK-41416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-41416: ------------------------------------ Assignee: (was: Apache Spark) > Rewrite self join in in predicate to aggregate > ---------------------------------------------- > > Key: SPARK-41416 > URL: https://issues.apache.org/jira/browse/SPARK-41416 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.4.0 > Reporter: Wan Kun > Priority: Major > > Transforms the SelfJoin resulting in duplicate rows used for IN predicate to > aggregation. > For IN predicate, duplicate rows does not have any value. It will be overhead. > Ex: TPCDS Q95: following CTE is used only in IN predicates for only one > column comparison ({@code ws_order_number}). > This results in exponential increase in Joined rows with too many duplicate > rows. > {code:java} > WITH ws_wh AS > ( > SELECT ws1.ws_order_number, > ws1.ws_warehouse_sk wh1, > ws2.ws_warehouse_sk wh2 > FROM web_sales ws1, > web_sales ws2 > WHERE ws1.ws_order_number = ws2.ws_order_number > AND ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk) > {code} > Could be optimized as below: > {code:java} > WITH ws_wh AS > (SELECT ws_order_number > FROM web_sales > GROUP BY ws_order_number > HAVING COUNT(DISTINCT ws_warehouse_sk) > 1) > {code} > Optimized CTE scans table only once and results in unique rows. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org