[ 
https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442433#comment-17442433
 ] 

Bruce Robbins commented on SPARK-37270:
---------------------------------------

I can reproduce locally. In 3.1, the above snippet returns a single row, but on 
3.2 and master, it returns nothing.

You don't need checkpoint or cache to reproduce: simply reading from a table 
will do it.
{noformat}
drop table if exists base1;

create table base1 stored as parquet as
select * from values (false) tbl(a);

select count(*) from (
  select case when a then 'I am not null' end condition
  from base1
) s
where s.condition is null;
{noformat}
In 3.1, the above query returns 1. In 3.2 and master, it returns 0.

The issue seems to be 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L680].
 Because {{elseValue}} is None, the code doesn't push the {{isnull}} into the 
else branch, which results in:
{noformat}
Filter CASE WHEN a#12 THEN isnull(I am not null) END
Relation default.base1[a#12] parquet
{noformat}
This gets simplified to:
{noformat}
Filter (a#12 AND isnull(I am not null))
Relation default.base1[a#12] parquet
{noformat}
This gets simplified to:
{noformat}
Filter false
Relation default.base1[a#12] parquet
{noformat}
And finally, this gets simplified to:
{noformat}
LocalRelation <empty>, [a#12]
{noformat}

> Incorect result of filter using isNull condition
> ------------------------------------------------
>
>                 Key: SPARK-37270
>                 URL: https://issues.apache.org/jira/browse/SPARK-37270
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 3.2.0
>            Reporter: Tomasz Kus
>            Priority: Major
>              Labels: correctness
>
> Simple code that allows to reproduce this issue:
> {code:java}
>  val frame = Seq((false, 1)).toDF("bool", "number")
> frame
>   .checkpoint()
>   .withColumn("conditions", when(col("bool"), "I am not null"))
>   .filter(col("conditions").isNull)
>   .show(false){code}
> Although "conditions" column is null
> {code:java}
>  +-----+------+----------+
> |bool |number|conditions|
> +-----+------+----------+
> |false|1     |null      |
> +-----+------+----------+{code}
> empty result is shown.
> Execution plans:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#252]
>    +- LogicalRDD [bool#124, number#125], false
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#252)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#252]
>    +- LogicalRDD [bool#124, number#125], false
> == Optimized Logical Plan ==
> LocalRelation <empty>, [bool#124, number#125, conditions#252]
> == Physical Plan ==
> LocalTableScan <empty>, [bool#124, number#125, conditions#252]
>  {code}
> After removing checkpoint proper result is returned  and execution plans are 
> as follow:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#256]
>    +- Project [_1#119 AS bool#124, _2#120 AS number#125]
>       +- LocalRelation [_1#119, _2#120]
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#256)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#256]
>    +- Project [_1#119 AS bool#124, _2#120 AS number#125]
>       +- LocalRelation [_1#119, _2#120]
> == Optimized Logical Plan ==
> LocalRelation [bool#124, number#125, conditions#256]
> == Physical Plan ==
> LocalTableScan [bool#124, number#125, conditions#256]
>  {code}
> It seems that the most important difference is LogicalRDD ->  LocalRelation
> There are following ways (workarounds) to retrieve correct result:
> 1) remove checkpoint
> 2) add explicit .otherwise(null) to when
> 3) add checkpoint() or cache() just before filter
> 4) downgrade to Spark 3.1.2



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to