[
https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matt Cheah updated SPARK-26078:
-------------------------------
Priority: Blocker (was: Major)
> WHERE .. IN fails to filter rows when used in combination with UNION
> --------------------------------------------------------------------
>
> Key: SPARK-26078
> URL: https://issues.apache.org/jira/browse/SPARK-26078
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.1, 2.4.0
> Reporter: Arttu Voutilainen
> Priority: Blocker
>
> Hey,
> We encountered a case where Spark SQL does not seem to handle WHERE .. IN
> correctly, when used in combination with UNION, but instead returns also rows
> that do not fulfill the condition. Swapping the order of the datasets in the
> UNION makes the problem go away. Repro below:
>
> {code}
> sql = SQLContext(sc)
> a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> a.registerTempTable('a')
> b.registerTempTable('b')
> bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'a' as source FROM a
> UNION ALL
> SELECT id, num, 'b' as source FROM b
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> no_bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'b' as source FROM b
> UNION ALL
> SELECT id, num, 'a' as source FROM a
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> bug.show()
> no_bug.show()
> bug.explain(True)
> no_bug.explain(True)
> {code}
> This results in one extra row in the "bug" DF coming from DF "b", that should
> not be there as it
> {code:java}
> >>> bug.show()
> +---+---+------+
> | id|num|source|
> +---+---+------+
> | a| 2| a|
> | a| 2| b|
> | b| 1| b|
> +---+---+------+
> >>> no_bug.show()
> +---+---+------+
> | id|num|source|
> +---+---+------+
> | a| 2| b|
> | a| 2| a|
> +---+---+------+
> {code}
> The reason can be seen in the query plans:
> {code:java}
> >>> bug.explain(True)
> ...
> == Optimized Logical Plan ==
> Union
> == Optimized Logical Plan ==
> Union
> :- Project [id#0, num#1L, a AS source#136]
> : +- Join LeftSemi, (id#0 = id#4)
> : :- LogicalRDD [id#0, num#1L], false
> : +- Project [id#4]
> : +- Filter (isnotnull(num#5L) && (num#5L = 2))
> : +- LogicalRDD [id#4, num#5L], false
> +- Join LeftSemi, (id#4#172 = id#4#172)
> :- Project [id#4, num#5L, b AS source#137]
> : +- LogicalRDD [id#4, num#5L], false
> +- Project [id#4 AS id#4#172]
> +- Filter (isnotnull(num#5L) && (num#5L = 2))
> +- LogicalRDD [id#4, num#5L], false
> {code}
> Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition
> seems wrong, and I believe it causes the LeftSemi to return true for all rows
> in the left-hand-side table, thus failing to filter as the WHERE .. IN
> should. Compare with the non-buggy version, where both LeftSemi joins have
> distinct #-things on both sides:
> {code:java}
> >>> no_bug.explain()
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#4, num#5L, b AS source#142]
> : +- Join LeftSemi, (id#4 = id#4#173)
> : :- LogicalRDD [id#4, num#5L], false
> : +- Project [id#4 AS id#4#173]
> : +- Filter (isnotnull(num#5L) && (num#5L = 2))
> : +- LogicalRDD [id#4, num#5L], false
> +- Project [id#0, num#1L, a AS source#143]
> +- Join LeftSemi, (id#0 = id#4#173)
> :- LogicalRDD [id#0, num#1L], false
> +- Project [id#4 AS id#4#173]
> +- Filter (isnotnull(num#5L) && (num#5L = 2))
> +- LogicalRDD [id#4, num#5L], false
> {code}
>
> Best,
> -Arttu
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]