[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400090#comment-15400090
 ] 

Nattavut Sutyanyong commented on SPARK-16804:
---------------------------------------------

scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 
LIMIT 1)").explain(true)
== Parsed Logical Plan ==
'Project ['c1]
+- 'Filter exists#21
   :  +- 'SubqueryAlias exists#21
   :     +- 'GlobalLimit 1
   :        +- 'LocalLimit 1
   :           +- 'Project [unresolvedalias(1, None)]
   :              +- 'Filter ('t1.c1 = 't2.c2)
   :                 +- 'UnresolvedRelation `t2`
   +- 'UnresolvedRelation `t1`

== Analyzed Logical Plan ==
c1: int
Project [c1#17]
+- Filter predicate-subquery#21 [(c1#17 = c2#10)]
   :  +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)]   <== This 
correlated predicate is incorrectly moved above the LIMIT
   :     +- GlobalLimit 1
   :        +- LocalLimit 1
   :           +- Project [1 AS 1#26, c2#10]
   :              +- SubqueryAlias t2
   :                 +- Project [value#8 AS c2#10]
   :                    +- LocalRelation [value#8]
   +- SubqueryAlias t1
      +- Project [value#15 AS c1#17]
         +- LocalRelation [value#15]

By rewriting the correlated predicate in the subquery in Analysis phase from 
below the LIMIT 1 operation to above it causing the scan of the subquery table 
to return only 1 row. The correct semantic is the LIMIT 1 must be applied on 
the subquery for each input value from the parent table.

> Correlated subqueries containing LIMIT return incorrect results
> ---------------------------------------------------------------
>
>                 Key: SPARK-16804
>                 URL: https://issues.apache.org/jira/browse/SPARK-16804
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+                                                                         
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to