Tomasz Kus created SPARK-37270: ---------------------------------- Summary: Incorect result of filter using isNull condition Key: SPARK-37270 URL: https://issues.apache.org/jira/browse/SPARK-37270 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0 Reporter: Tomasz Kus
Simple code that allows to reproduce this issue: {code:java} val frame = Seq((false, 1)).toDF("bool", "number") frame .checkpoint() .withColumn("conditions", when(col("bool"), "I am not null")) .filter(col("conditions").isNull) .show(false){code} Although "conditions" column is null {code:java} +-----+------+----------+ |bool |number|conditions| +-----+------+----------+ |false|1 |null | +-----+------+----------+{code} empty result is shown. Execution plans: {code:java} == Parsed Logical Plan == 'Filter isnull('conditions) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#252] +- LogicalRDD [bool#124, number#125], false == Analyzed Logical Plan == bool: boolean, number: int, conditions: string Filter isnull(conditions#252) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#252] +- LogicalRDD [bool#124, number#125], false == Optimized Logical Plan == LocalRelation <empty>, [bool#124, number#125, conditions#252] == Physical Plan == LocalTableScan <empty>, [bool#124, number#125, conditions#252] {code} After removing checkpoint proper result is returned and execution plans are as follow: {code:java} == Parsed Logical Plan == 'Filter isnull('conditions) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#256] +- Project [_1#119 AS bool#124, _2#120 AS number#125] +- LocalRelation [_1#119, _2#120] == Analyzed Logical Plan == bool: boolean, number: int, conditions: string Filter isnull(conditions#256) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#256] +- Project [_1#119 AS bool#124, _2#120 AS number#125] +- LocalRelation [_1#119, _2#120] == Optimized Logical Plan == LocalRelation [bool#124, number#125, conditions#256] == Physical Plan == LocalTableScan [bool#124, number#125, conditions#256] {code} It seems that the most important difference is LogicalRDD -> LocalRelation There are following ways (workarounds) to retrieve correct result: 1) remove checkpoint 2) add explicit .otherwise(null) to when 3) add checkpoint() or cache() just before filter 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org