[ 
https://issues.apache.org/jira/browse/SPARK-18966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767783#comment-15767783
 ] 

Nattavut Sutyanyong commented on SPARK-18966:
---------------------------------------------

{code}
== Analyzed Logical Plan ==
a1: int, b1: int
Project [a1#684, b1#685]
+- Filter NOT predicate-subquery#717 [(a1#684 = a2#695) && (b2#696 = b1#685)]
   :  +- Project [a2#695, b2#696]
   :     +- SubqueryAlias t2, `t2`
   :        +- Project [_1#692 AS a2#695, _2#693 AS b2#696]
   :           +- LocalRelation [_1#692, _2#693]
   +- SubqueryAlias t1, `t1`
      +- Project [_1#681 AS a1#684, _2#682 AS b1#685]
         +- LocalRelation [_1#681, _2#682]

== Optimized Logical Plan ==
Project [_1#681 AS a1#684, _2#682 AS b1#685]
+- Join LeftAnti, ((isnull((_1#681 = a2#695)) || isnull((b2#696 = _2#682))) || 
((_1#681 = a2#695) && (b2#696 = _2#682)))
   :- LocalRelation [_1#681, _2#682]
   +- LocalRelation [a2#695, b2#696]
{code}

When the comparison between b1 and b2 is unknown, the subquery returns no row 
hence the result is all the rows from the parent. The problem in the plan above 
is the expression
isnull((b2#696 = _2#682)) is evaluated to isnull( unknown ) which is true — 
that results in the entire expression of the LeftAnti is true. By the 
definition of LeftAnti, the rows are matched and should be filtered.

> NOT IN subquery with correlated expressions may return incorrect result
> -----------------------------------------------------------------------
>
>                 Key: SPARK-18966
>                 URL: https://issues.apache.org/jira/browse/SPARK-18966
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>              Labels: correctness
>
> {code}
> Seq((1, 2)).toDF("a1", "b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer, java.lang.Integer)]((1, null)).toDF("a2", 
> "b2").createOrReplaceTempView("t2")
> // The expected result is 1 row of (1,2) as shown in the next statement.
> sql("select * from t1 where a1 not in (select a2 from t2 where b2 = b1)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> sql("select * from t1 where a1 not in (select a2 from t2 where b2 = 2)").show
> +---+---+
> | a1| b1|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> The two SQL statements above should return the same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to