[ https://issues.apache.org/jira/browse/SPARK-41162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648751#comment-17648751 ]
Shardul Mahadik edited comment on SPARK-41162 at 12/16/22 6:27 PM: ------------------------------------------------------------------- [~cloud_fan] Can you help take a look at this? This is a correctness issue and affects not just master but also 3.1+ if I am not wrong. We hit this issue in production with one of our user jobs. was (Author: shardulm): [~cloud_fan] Can you help take a look at this? This is a correctness issue affects not just master but also 3.1+ if I am not wrong. We hit this issue in production with one of our user jobs. > Anti-join must not be pushed below aggregation with ambiguous predicates > ------------------------------------------------------------------------ > > Key: SPARK-41162 > URL: https://issues.apache.org/jira/browse/SPARK-41162 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.4.0 > Reporter: Enrico Minack > Priority: Major > > The following query should return a single row as all values for {{id}} > except for the largest will be eliminated by the anti-join: > {code} > val ids = Seq(1, 2, 3).toDF("id").distinct() > val result = ids.withColumn("id", $"id" + 1).join(ids, "id", > "left_anti").collect() > assert(result.length == 1) > {code} > Without the {{distinct()}}, the assertion is true. With {{distinct()}}, the > assertion should still hold but is false. > Rule {{PushDownLeftSemiAntiJoin}} pushes the {{Join}} below the left > {{Aggregate}} with join condition {{(id#750 + 1) = id#750}}, which can never > be true. > {code} > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin === > !Join LeftAnti, (id#752 = id#750) 'Aggregate [id#750], > [(id#750 + 1) AS id#752] > !:- Aggregate [id#750], [(id#750 + 1) AS id#752] +- 'Join LeftAnti, > ((id#750 + 1) = id#750) > !: +- LocalRelation [id#750] :- LocalRelation > [id#750] > !+- Aggregate [id#750], [id#750] +- Aggregate [id#750], > [id#750] > ! +- LocalRelation [id#750] +- LocalRelation > [id#750] > {code} > The optimizer then rightly removes the left-anti join altogether, returning > the left child only. > Rule {{PushDownLeftSemiAntiJoin}} should not push down predicates that > reference left *and* right child. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org