[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400090#comment-15400090 ]
Nattavut Sutyanyong commented on SPARK-16804: --------------------------------------------- scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").explain(true) == Parsed Logical Plan == 'Project ['c1] +- 'Filter exists#21 : +- 'SubqueryAlias exists#21 : +- 'GlobalLimit 1 : +- 'LocalLimit 1 : +- 'Project [unresolvedalias(1, None)] : +- 'Filter ('t1.c1 = 't2.c2) : +- 'UnresolvedRelation `t2` +- 'UnresolvedRelation `t1` == Analyzed Logical Plan == c1: int Project [c1#17] +- Filter predicate-subquery#21 [(c1#17 = c2#10)] : +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)] <== This correlated predicate is incorrectly moved above the LIMIT : +- GlobalLimit 1 : +- LocalLimit 1 : +- Project [1 AS 1#26, c2#10] : +- SubqueryAlias t2 : +- Project [value#8 AS c2#10] : +- LocalRelation [value#8] +- SubqueryAlias t1 +- Project [value#15 AS c1#17] +- LocalRelation [value#15] By rewriting the correlated predicate in the subquery in Analysis phase from below the LIMIT 1 operation to above it causing the scan of the subquery table to return only 1 row. The correct semantic is the LIMIT 1 must be applied on the subquery for each input value from the parent table. > Correlated subqueries containing LIMIT return incorrect results > --------------------------------------------------------------- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org