[jira] [Updated] (SPARK-37270) Incorect result of filter using isNull condition

2021-11-10 Thread Tomasz Kus (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Kus updated SPARK-37270:
---
Component/s: SQL

> Incorect result of filter using isNull condition
> 
>
> Key: SPARK-37270
> URL: https://issues.apache.org/jira/browse/SPARK-37270
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Tomasz Kus
>Priority: Major
>
> Simple code that allows to reproduce this issue:
> {code:java}
>  val frame = Seq((false, 1)).toDF("bool", "number")
> frame
>   .checkpoint()
>   .withColumn("conditions", when(col("bool"), "I am not null"))
>   .filter(col("conditions").isNull)
>   .show(false){code}
> Although "conditions" column is null
> {code:java}
>  +-+--+--+
> |bool |number|conditions|
> +-+--+--+
> |false|1     |null      |
> +-+--+--+{code}
> empty result is shown.
> Execution plans:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#252]
>    +- LogicalRDD [bool#124, number#125], false
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#252)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#252]
>    +- LogicalRDD [bool#124, number#125], false
> == Optimized Logical Plan ==
> LocalRelation , [bool#124, number#125, conditions#252]
> == Physical Plan ==
> LocalTableScan , [bool#124, number#125, conditions#252]
>  {code}
> After removing checkpoint proper result is returned  and execution plans are 
> as follow:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#256]
>    +- Project [_1#119 AS bool#124, _2#120 AS number#125]
>       +- LocalRelation [_1#119, _2#120]
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#256)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#256]
>    +- Project [_1#119 AS bool#124, _2#120 AS number#125]
>       +- LocalRelation [_1#119, _2#120]
> == Optimized Logical Plan ==
> LocalRelation [bool#124, number#125, conditions#256]
> == Physical Plan ==
> LocalTableScan [bool#124, number#125, conditions#256]
>  {code}
> It seems that the most important difference is LogicalRDD ->  LocalRelation
> There are following ways (workarounds) to retrieve correct result:
> 1) remove checkpoint
> 2) add explicit .otherwise(null) to when
> 3) add checkpoint() or cache() just before filter
> 4) downgrade to Spark 3.1.2



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37270) Incorect result of filter using isNull condition

2021-11-10 Thread Tomasz Kus (Jira)
Tomasz Kus created SPARK-37270:
--

 Summary: Incorect result of filter using isNull condition
 Key: SPARK-37270
 URL: https://issues.apache.org/jira/browse/SPARK-37270
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Tomasz Kus


Simple code that allows to reproduce this issue:
{code:java}
 val frame = Seq((false, 1)).toDF("bool", "number")
frame
  .checkpoint()
  .withColumn("conditions", when(col("bool"), "I am not null"))
  .filter(col("conditions").isNull)
  .show(false){code}
Although "conditions" column is null
{code:java}
 +-+--+--+
|bool |number|conditions|
+-+--+--+
|false|1     |null      |
+-+--+--+{code}
empty result is shown.

Execution plans:
{code:java}
== Parsed Logical Plan ==
'Filter isnull('conditions)
+- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS 
conditions#252]
   +- LogicalRDD [bool#124, number#125], false

== Analyzed Logical Plan ==
bool: boolean, number: int, conditions: string
Filter isnull(conditions#252)
+- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS 
conditions#252]
   +- LogicalRDD [bool#124, number#125], false

== Optimized Logical Plan ==
LocalRelation , [bool#124, number#125, conditions#252]

== Physical Plan ==
LocalTableScan , [bool#124, number#125, conditions#252]
 {code}
After removing checkpoint proper result is returned  and execution plans are as 
follow:
{code:java}
== Parsed Logical Plan ==
'Filter isnull('conditions)
+- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS 
conditions#256]
   +- Project [_1#119 AS bool#124, _2#120 AS number#125]
      +- LocalRelation [_1#119, _2#120]

== Analyzed Logical Plan ==
bool: boolean, number: int, conditions: string
Filter isnull(conditions#256)
+- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS 
conditions#256]
   +- Project [_1#119 AS bool#124, _2#120 AS number#125]
      +- LocalRelation [_1#119, _2#120]

== Optimized Logical Plan ==
LocalRelation [bool#124, number#125, conditions#256]

== Physical Plan ==
LocalTableScan [bool#124, number#125, conditions#256]
 {code}
It seems that the most important difference is LogicalRDD ->  LocalRelation

There are following ways (workarounds) to retrieve correct result:

1) remove checkpoint

2) add explicit .otherwise(null) to when

3) add checkpoint() or cache() just before filter

4) downgrade to Spark 3.1.2



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org