[ https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
angerszhu resolved SPARK-29769. ------------------------------- Resolution: Implemented > Spark SQL cannot handle "exists/not exists" condition when using "JOIN" > ----------------------------------------------------------------------- > > Key: SPARK-29769 > URL: https://issues.apache.org/jira/browse/SPARK-29769 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0, 3.0.0 > Reporter: angerszhu > Priority: Major > > In origin master, we can'y run sql use `EXISTS/NOT EXISTS` in Join's on > condition: > {code} > create temporary view s1 as select * from values > (1), (3), (5), (7), (9) > as s1(id); > create temporary view s2 as select * from values > (1), (3), (4), (6), (9) > as s2(id); > create temporary view s3 as select * from values > (3), (4), (6), (9) > as s3(id); > explain extended SELECT s1.id, s2.id as id2 FROM s1 > LEFT OUTER JOIN s2 ON s1.id = s2.id > AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6) > > we will get > == Parsed Logical Plan == > 'Project ['s1.id, 's2.id AS id2#4] > +- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 []) > : +- 'Project [*] > : +- 'Filter ('s3.id > 6) > : +- 'UnresolvedRelation `s3` > :- 'UnresolvedRelation `s1` > +- 'UnresolvedRelation `s2` > == Analyzed Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 > pos 27; > 'Project ['s1.id, 's2.id AS id2#4] > +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) > : +- 'Project [*] > : +- 'Filter ('s3.id > 6) > : +- 'UnresolvedRelation `s3` > :- SubqueryAlias `s1` > : +- Project [id#0] > : +- SubqueryAlias `s1` > : +- LocalRelation [id#0] > +- SubqueryAlias `s2` > +- Project [id#1] > +- SubqueryAlias `s2` > +- LocalRelation [id#1] > org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 > pos 27; > 'Project ['s1.id, 's2.id AS id2#4] > +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) > : +- 'Project [*] > : +- 'Filter ('s3.id > 6) > : +- 'UnresolvedRelation `s3` > :- SubqueryAlias `s1` > : +- Project [id#0] > : +- SubqueryAlias `s1` > : +- LocalRelation [id#0] > +- SubqueryAlias `s2` > +- Project [id#1] > +- SubqueryAlias `s2` > +- LocalRelation [id#1] > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 > pos 27; > 'Project ['s1.id, 's2.id AS id2#4] > +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) > : +- 'Project [*] > : +- 'Filter ('s3.id > 6) > : +- 'UnresolvedRelation `s3` > :- SubqueryAlias `s1` > : +- Project [id#0] > : +- SubqueryAlias `s1` > : +- LocalRelation [id#0] > +- SubqueryAlias `s2` > +- Project [id#1] > +- SubqueryAlias `s2` > +- LocalRelation [id#1] > == Physical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 > pos 27; > 'Project ['s1.id, 's2.id AS id2#4] > +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) > : +- 'Project [*] > : +- 'Filter ('s3.id > 6) > : +- 'UnresolvedRelation `s3` > :- SubqueryAlias `s1` > : +- Project [id#0] > : +- SubqueryAlias `s1` > : +- LocalRelation [id#0] > +- SubqueryAlias `s2` > +- Project [id#1] > +- SubqueryAlias `s2` > +- LocalRelation [id#1] > Time taken: 1.455 seconds, Fetched 1 row(s) > {code} > Since in analyzer , it won't solve join's condition's SubQuery in > *Analyzer.ResolveSubquery*, then table *s3* was unresolved. > After pr https://github.com/apache/spark/pull/25854/files > We will solve subqueries in join condition and it will pass analyzer level. > In current master, if we run sql above, we will get > {code} > == Parsed Logical Plan == > 'Project ['s1.id, 's2.id AS id2#291] > +- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 []) > : +- 'Project [*] > : +- 'Filter ('s3.id > 6) > : +- 'UnresolvedRelation [s3] > :- 'UnresolvedRelation [s1] > +- 'UnresolvedRelation [s2] > == Analyzed Logical Plan == > id: int, id2: int > Project [id#244, id#250 AS id2#291] > +- Join LeftOuter, ((id#244 = id#250) AND exists#290 []) > : +- Project [id#256] > : +- Filter (id#256 > 6) > : +- SubqueryAlias `s3` > : +- Project [value#253 AS id#256] > : +- LocalRelation [value#253] > :- SubqueryAlias `s1` > : +- Project [value#241 AS id#244] > : +- LocalRelation [value#241] > +- SubqueryAlias `s2` > +- Project [value#247 AS id#250] > +- LocalRelation [value#247] > == Optimized Logical Plan == > Project [id#244, id#250 AS id2#291] > +- Join LeftOuter, (exists#290 [] AND (id#244 = id#250)) > : +- Project [value#253 AS id#256] > : +- Filter (value#253 > 6) > : +- LocalRelation [value#253] > :- Project [value#241 AS id#244] > : +- LocalRelation [value#241] > +- Project [value#247 AS id#250] > +- LocalRelation [value#247] > == Physical Plan == > *(2) Project [id#244, id#250 AS id2#291] > +- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, > exists#290 [] > : +- Project [value#253 AS id#256] > : +- Filter (value#253 > 6) > : +- LocalRelation [value#253] > :- *(2) Project [value#241 AS id#244] > : +- *(2) LocalTableScan [value#241] > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint))), [id=#670] > +- *(1) Project [value#247 AS id#250] > +- *(1) LocalTableScan [value#247] > {code} > We can see that, exists subquery was resolved but it haven't been changed to > a SparkPlan. Then the PhysicalPlan can't execute. > You can see a LocalRelation on PhysicalPlan > {code} > +- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, > exists#290 [] > : +- Project [value#253 AS id#256] > : +- Filter (value#253 > 6) > : +- LocalRelation [value#253] > {code} > Final I found that we can't use {color:red}EXISTS/NOT EXISTS{color} as *on* > condition in *{color:red}LEFTE OUTER JOIN/ FULL OUTER JOIN / LEFT ANTI > JOIN{color}* > Because in Optimizer's *PushPredicateThroughJoin*, it can't cover these case. > Make a pr for support these join type. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org