[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400105#comment-15400105 ]
Nattavut Sutyanyong commented on SPARK-16804: --------------------------------------------- This fix blocks any correlated subquery when there is a LIMIT operation on the path from the parent table to the correlated predicate. We may consider relaxing this restriction once we have a better support on processing correlated subquery in run-time. SPARK-13417 is an umbrella task to track this effort. Note that if the LIMIT is not in the correlated path, Spark returns correct result. Example: sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2) and exists (select 1 from t2 LIMIT 1)").show will return both rows from T1, which is correctly handled with and without this proposed fix. This fix will change the behaviour of the query sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").show to return an error from the Analysis phase as shown below: org.apache.spark.sql.AnalysisException: Accessing outer query column is not allowed in a LIMIT: LocalLimit 1 ... > Correlated subqueries containing LIMIT return incorrect results > --------------------------------------------------------------- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org