Tomasz Kus created SPARK-37270:
----------------------------------

             Summary: Incorect result of filter using isNull condition
                 Key: SPARK-37270
                 URL: https://issues.apache.org/jira/browse/SPARK-37270
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.2.0
            Reporter: Tomasz Kus


Simple code that allows to reproduce this issue:
{code:java}
 val frame = Seq((false, 1)).toDF("bool", "number")
frame
  .checkpoint()
  .withColumn("conditions", when(col("bool"), "I am not null"))
  .filter(col("conditions").isNull)
  .show(false){code}
Although "conditions" column is null
{code:java}
 +-----+------+----------+
|bool |number|conditions|
+-----+------+----------+
|false|1     |null      |
+-----+------+----------+{code}
empty result is shown.

Execution plans:
{code:java}
== Parsed Logical Plan ==
'Filter isnull('conditions)
+- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS 
conditions#252]
   +- LogicalRDD [bool#124, number#125], false

== Analyzed Logical Plan ==
bool: boolean, number: int, conditions: string
Filter isnull(conditions#252)
+- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS 
conditions#252]
   +- LogicalRDD [bool#124, number#125], false

== Optimized Logical Plan ==
LocalRelation <empty>, [bool#124, number#125, conditions#252]

== Physical Plan ==
LocalTableScan <empty>, [bool#124, number#125, conditions#252]
 {code}
After removing checkpoint proper result is returned  and execution plans are as 
follow:
{code:java}
== Parsed Logical Plan ==
'Filter isnull('conditions)
+- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS 
conditions#256]
   +- Project [_1#119 AS bool#124, _2#120 AS number#125]
      +- LocalRelation [_1#119, _2#120]

== Analyzed Logical Plan ==
bool: boolean, number: int, conditions: string
Filter isnull(conditions#256)
+- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS 
conditions#256]
   +- Project [_1#119 AS bool#124, _2#120 AS number#125]
      +- LocalRelation [_1#119, _2#120]

== Optimized Logical Plan ==
LocalRelation [bool#124, number#125, conditions#256]

== Physical Plan ==
LocalTableScan [bool#124, number#125, conditions#256]
 {code}
It seems that the most important difference is LogicalRDD ->  LocalRelation

There are following ways (workarounds) to retrieve correct result:

1) remove checkpoint

2) add explicit .otherwise(null) to when

3) add checkpoint() or cache() just before filter

4) downgrade to Spark 3.1.2



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to