[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400105#comment-15400105
 ] 

Nattavut Sutyanyong commented on SPARK-16804:
---------------------------------------------

This fix blocks any correlated subquery when there is a LIMIT operation on the 
path from the parent table to the correlated predicate. We may consider 
relaxing this restriction once we have a better support on processing 
correlated subquery in run-time. SPARK-13417 is an umbrella task to track this 
effort.

Note that if the LIMIT is not in the correlated path, Spark returns correct 
result.

Example:

sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2) and 
exists (select 1 from t2 LIMIT 1)").show

will return both rows from T1, which is correctly handled with and without this 
proposed fix.

This fix will change the behaviour of the query

sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
1)").show

to return an error from the Analysis phase as shown below:

org.apache.spark.sql.AnalysisException: Accessing outer query column is not 
allowed in a LIMIT: LocalLimit 1
...

> Correlated subqueries containing LIMIT return incorrect results
> ---------------------------------------------------------------
>
>                 Key: SPARK-16804
>                 URL: https://issues.apache.org/jira/browse/SPARK-16804
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+                                                                         
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to