[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442433#comment-17442433 ]
Bruce Robbins commented on SPARK-37270: --------------------------------------- I can reproduce locally. In 3.1, the above snippet returns a single row, but on 3.2 and master, it returns nothing. You don't need checkpoint or cache to reproduce: simply reading from a table will do it. {noformat} drop table if exists base1; create table base1 stored as parquet as select * from values (false) tbl(a); select count(*) from ( select case when a then 'I am not null' end condition from base1 ) s where s.condition is null; {noformat} In 3.1, the above query returns 1. In 3.2 and master, it returns 0. The issue seems to be [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L680]. Because {{elseValue}} is None, the code doesn't push the {{isnull}} into the else branch, which results in: {noformat} Filter CASE WHEN a#12 THEN isnull(I am not null) END Relation default.base1[a#12] parquet {noformat} This gets simplified to: {noformat} Filter (a#12 AND isnull(I am not null)) Relation default.base1[a#12] parquet {noformat} This gets simplified to: {noformat} Filter false Relation default.base1[a#12] parquet {noformat} And finally, this gets simplified to: {noformat} LocalRelation <empty>, [a#12] {noformat} > Incorect result of filter using isNull condition > ------------------------------------------------ > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 3.2.0 > Reporter: Tomasz Kus > Priority: Major > Labels: correctness > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-----+------+----------+ > |bool |number|conditions| > +-----+------+----------+ > |false|1 |null | > +-----+------+----------+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation <empty>, [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan <empty>, [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org