[ https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16689368#comment-16689368 ]
Apache Spark commented on SPARK-26078: -------------------------------------- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/23057 > WHERE .. IN fails to filter rows when used in combination with UNION > -------------------------------------------------------------------- > > Key: SPARK-26078 > URL: https://issues.apache.org/jira/browse/SPARK-26078 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.1, 2.4.0 > Reporter: Arttu Voutilainen > Priority: Blocker > Labels: correctness > > Hey, > We encountered a case where Spark SQL does not seem to handle WHERE .. IN > correctly, when used in combination with UNION, but instead returns also rows > that do not fulfill the condition. Swapping the order of the datasets in the > UNION makes the problem go away. Repro below: > > {code} > sql = SQLContext(sc) > a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}]) > b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}]) > a.registerTempTable('a') > b.registerTempTable('b') > bug = sql.sql(""" > SELECT id,num,source FROM > ( > SELECT id, num, 'a' as source FROM a > UNION ALL > SELECT id, num, 'b' as source FROM b > ) AS c > WHERE c.id IN (SELECT id FROM b WHERE num = 2) > """) > no_bug = sql.sql(""" > SELECT id,num,source FROM > ( > SELECT id, num, 'b' as source FROM b > UNION ALL > SELECT id, num, 'a' as source FROM a > ) AS c > WHERE c.id IN (SELECT id FROM b WHERE num = 2) > """) > bug.show() > no_bug.show() > bug.explain(True) > no_bug.explain(True) > {code} > This results in one extra row in the "bug" DF coming from DF "b", that should > not be there as it > {code:java} > >>> bug.show() > +---+---+------+ > | id|num|source| > +---+---+------+ > | a| 2| a| > | a| 2| b| > | b| 1| b| > +---+---+------+ > >>> no_bug.show() > +---+---+------+ > | id|num|source| > +---+---+------+ > | a| 2| b| > | a| 2| a| > +---+---+------+ > {code} > The reason can be seen in the query plans: > {code:java} > >>> bug.explain(True) > ... > == Optimized Logical Plan == > Union > :- Project [id#0, num#1L, a AS source#136] > : +- Join LeftSemi, (id#0 = id#4) > : :- LogicalRDD [id#0, num#1L], false > : +- Project [id#4] > : +- Filter (isnotnull(num#5L) && (num#5L = 2)) > : +- LogicalRDD [id#4, num#5L], false > +- Join LeftSemi, (id#4#172 = id#4#172) > :- Project [id#4, num#5L, b AS source#137] > : +- LogicalRDD [id#4, num#5L], false > +- Project [id#4 AS id#4#172] > +- Filter (isnotnull(num#5L) && (num#5L = 2)) > +- LogicalRDD [id#4, num#5L], false > {code} > Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition > seems wrong, and I believe it causes the LeftSemi to return true for all rows > in the left-hand-side table, thus failing to filter as the WHERE .. IN > should. Compare with the non-buggy version, where both LeftSemi joins have > distinct #-things on both sides: > {code:java} > >>> no_bug.explain() > ... > == Optimized Logical Plan == > Union > :- Project [id#4, num#5L, b AS source#142] > : +- Join LeftSemi, (id#4 = id#4#173) > : :- LogicalRDD [id#4, num#5L], false > : +- Project [id#4 AS id#4#173] > : +- Filter (isnotnull(num#5L) && (num#5L = 2)) > : +- LogicalRDD [id#4, num#5L], false > +- Project [id#0, num#1L, a AS source#143] > +- Join LeftSemi, (id#0 = id#4#173) > :- LogicalRDD [id#0, num#1L], false > +- Project [id#4 AS id#4#173] > +- Filter (isnotnull(num#5L) && (num#5L = 2)) > +- LogicalRDD [id#4, num#5L], false > {code} > > Best, > -Arttu > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org