[ 
https://issues.apache.org/jira/browse/SPARK-38868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523197#comment-17523197
 ] 

Bruce Robbins commented on SPARK-38868:
---------------------------------------

The issue is in {{EliminateOuterJoin}}. That rule wants to see if it can 
convert the outer join to a inner join based on the where clause, which in this 
case is
{noformat}
where assert_true(outcome_value > 10) is null
{noformat}
So the rule evaluates that expression, forcing {{outcome_value}} to be {{null}} 
to see if the result of the expression is {{null}} or {{false}}. That rule 
doesn't ever expect the result to be an {{RuntimeException}}, which it is in 
this case.

I think the easy fix is to add this to {{EliminateOuterJoin#canFilterOutNull}}
{noformat}
if (boundE.exists(_.isInstanceOf[RaiseError])) return false
{noformat}

> `assert_true` fails unconditionnaly after `left_outer` joins
> ------------------------------------------------------------
>
>                 Key: SPARK-38868
>                 URL: https://issues.apache.org/jira/browse/SPARK-38868
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.1.1, 3.1.2, 3.2.0, 3.2.1
>            Reporter: Fabien Dubosson
>            Priority: Major
>
> When `assert_true` is used after a `left_outer` join the assert exception is 
> raised even though all the rows meet the condition. Using an `inner` join 
> does not expose this issue.
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> session = SparkSession.builder.getOrCreate()
> entries = session.createDataFrame(
>     [
>         ("a", 1),
>         ("b", 2),
>         ("c", 3),
>     ],
>     ["id", "outcome_id"],
> )
> outcomes = session.createDataFrame(
>     [
>         (1, 12),
>         (2, 34),
>         (3, 32),
>     ],
>     ["outcome_id", "outcome_value"],
> )
> # Inner join works as expected
> (
>     entries.join(outcomes, on="outcome_id", how="inner")
>     .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10))
>     .filter(sf.col("valid").isNull())
>     .show()
> )
> # Left join fails with «'('outcome_value > 10)' is not true!» even though it 
> is the case
> (
>     entries.join(outcomes, on="outcome_id", how="left_outer")
>     .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10))
>     .filter(sf.col("valid").isNull())
>     .show()
> ){code}
> Reproduced on `pyspark` versions: `3.2.1`, `3.2.0`, `3.1.2` and `3.1.1`. I am 
> not sure if "native" Spark exposes this issue as well or not, I don't have 
> the knowledge/setup to test that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to