[ https://issues.apache.org/jira/browse/SPARK-38868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523197#comment-17523197 ]
Bruce Robbins commented on SPARK-38868: --------------------------------------- The issue is in {{EliminateOuterJoin}}. That rule wants to see if it can convert the outer join to a inner join based on the where clause, which in this case is {noformat} where assert_true(outcome_value > 10) is null {noformat} So the rule evaluates that expression, forcing {{outcome_value}} to be {{null}} to see if the result of the expression is {{null}} or {{false}}. That rule doesn't ever expect the result to be an {{RuntimeException}}, which it is in this case. I think the easy fix is to add this to {{EliminateOuterJoin#canFilterOutNull}} {noformat} if (boundE.exists(_.isInstanceOf[RaiseError])) return false {noformat} > `assert_true` fails unconditionnaly after `left_outer` joins > ------------------------------------------------------------ > > Key: SPARK-38868 > URL: https://issues.apache.org/jira/browse/SPARK-38868 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.1.1, 3.1.2, 3.2.0, 3.2.1 > Reporter: Fabien Dubosson > Priority: Major > > When `assert_true` is used after a `left_outer` join the assert exception is > raised even though all the rows meet the condition. Using an `inner` join > does not expose this issue. > > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as sf > session = SparkSession.builder.getOrCreate() > entries = session.createDataFrame( > [ > ("a", 1), > ("b", 2), > ("c", 3), > ], > ["id", "outcome_id"], > ) > outcomes = session.createDataFrame( > [ > (1, 12), > (2, 34), > (3, 32), > ], > ["outcome_id", "outcome_value"], > ) > # Inner join works as expected > ( > entries.join(outcomes, on="outcome_id", how="inner") > .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10)) > .filter(sf.col("valid").isNull()) > .show() > ) > # Left join fails with «'('outcome_value > 10)' is not true!» even though it > is the case > ( > entries.join(outcomes, on="outcome_id", how="left_outer") > .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10)) > .filter(sf.col("valid").isNull()) > .show() > ){code} > Reproduced on `pyspark` versions: `3.2.1`, `3.2.0`, `3.1.2` and `3.1.1`. I am > not sure if "native" Spark exposes this issue as well or not, I don't have > the knowledge/setup to test that. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org