[ https://issues.apache.org/jira/browse/SPARK-18966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767783#comment-15767783 ]
Nattavut Sutyanyong edited comment on SPARK-18966 at 12/21/16 6:34 PM: ----------------------------------------------------------------------- {code} == Analyzed Logical Plan == a1: int, b1: int Project [a1#684, b1#685] +- Filter NOT predicate-subquery#717 [(a1#684 = a2#695) && (b2#696 = b1#685)] : +- Project [a2#695, b2#696] : +- SubqueryAlias t2, `t2` : +- Project [_1#692 AS a2#695, _2#693 AS b2#696] : +- LocalRelation [_1#692, _2#693] +- SubqueryAlias t1, `t1` +- Project [_1#681 AS a1#684, _2#682 AS b1#685] +- LocalRelation [_1#681, _2#682] == Optimized Logical Plan == Project [_1#681 AS a1#684, _2#682 AS b1#685] +- Join LeftAnti, ((isnull((_1#681 = a2#695)) || isnull((b2#696 = _2#682))) || ((_1#681 = a2#695) && (b2#696 = _2#682))) :- LocalRelation [_1#681, _2#682] +- LocalRelation [a2#695, b2#696] {code} When the comparison between b1 and b2 is unknown, the subquery returns no row hence the result is all the rows from the parent. The problem in the plan above is the expression {{isnull((b2#696 = _2#682))}} is evaluated to isnull( unknown ) which is true — that results in the entire expression of the LeftAnti is true. By the definition of LeftAnti, the rows are matched and should be filtered. was (Author: nsyca): {code} == Analyzed Logical Plan == a1: int, b1: int Project [a1#684, b1#685] +- Filter NOT predicate-subquery#717 [(a1#684 = a2#695) && (b2#696 = b1#685)] : +- Project [a2#695, b2#696] : +- SubqueryAlias t2, `t2` : +- Project [_1#692 AS a2#695, _2#693 AS b2#696] : +- LocalRelation [_1#692, _2#693] +- SubqueryAlias t1, `t1` +- Project [_1#681 AS a1#684, _2#682 AS b1#685] +- LocalRelation [_1#681, _2#682] == Optimized Logical Plan == Project [_1#681 AS a1#684, _2#682 AS b1#685] +- Join LeftAnti, ((isnull((_1#681 = a2#695)) || isnull((b2#696 = _2#682))) || ((_1#681 = a2#695) && (b2#696 = _2#682))) :- LocalRelation [_1#681, _2#682] +- LocalRelation [a2#695, b2#696] {code} When the comparison between b1 and b2 is unknown, the subquery returns no row hence the result is all the rows from the parent. The problem in the plan above is the expression {{isnull((b2#696 = _2#682))}} is evaluated to isnull( unknown ) which is true — that results in the entire expression of the LeftAnti is true. By the definition of LeftAnti, the rows are matched and should be filtered. > NOT IN subquery with correlated expressions may return incorrect result > ----------------------------------------------------------------------- > > Key: SPARK-18966 > URL: https://issues.apache.org/jira/browse/SPARK-18966 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Nattavut Sutyanyong > Labels: correctness > > {code} > Seq((1, 2)).toDF("a1", "b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer, java.lang.Integer)]((1, null)).toDF("a2", > "b2").createOrReplaceTempView("t2") > // The expected result is 1 row of (1,2) as shown in the next statement. > sql("select * from t1 where a1 not in (select a2 from t2 where b2 = b1)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > sql("select * from t1 where a1 not in (select a2 from t2 where b2 = 2)").show > +---+---+ > | a1| b1| > +---+---+ > | 1| 2| > +---+---+ > {code} > The two SQL statements above should return the same result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org