[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results

2016-07-29 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400150#comment-15400150
 ] 

Nattavut Sutyanyong commented on SPARK-16804:
-

To demonstrate that this fix does not unnecessarily block the "good" cases 
(where LIMIT is present but NOT on the correlated path), here is an example, 
which produce the same result set in both with and without this proposed fix.

scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 
limit 1) where t1.c1=t2.c2)").show 
+---+   
| c1|
+---+
|  1|
+---+


> Correlated subqueries containing LIMIT return incorrect results
> ---
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results

2016-07-29 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400105#comment-15400105
 ] 

Nattavut Sutyanyong commented on SPARK-16804:
-

This fix blocks any correlated subquery when there is a LIMIT operation on the 
path from the parent table to the correlated predicate. We may consider 
relaxing this restriction once we have a better support on processing 
correlated subquery in run-time. SPARK-13417 is an umbrella task to track this 
effort.

Note that if the LIMIT is not in the correlated path, Spark returns correct 
result.

Example:

sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2) and 
exists (select 1 from t2 LIMIT 1)").show

will return both rows from T1, which is correctly handled with and without this 
proposed fix.

This fix will change the behaviour of the query

sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
1)").show

to return an error from the Analysis phase as shown below:

org.apache.spark.sql.AnalysisException: Accessing outer query column is not 
allowed in a LIMIT: LocalLimit 1
...

> Correlated subqueries containing LIMIT return incorrect results
> ---
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results

2016-07-29 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400090#comment-15400090
 ] 

Nattavut Sutyanyong commented on SPARK-16804:
-

scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 
LIMIT 1)").explain(true)
== Parsed Logical Plan ==
'Project ['c1]
+- 'Filter exists#21
   :  +- 'SubqueryAlias exists#21
   : +- 'GlobalLimit 1
   :+- 'LocalLimit 1
   :   +- 'Project [unresolvedalias(1, None)]
   :  +- 'Filter ('t1.c1 = 't2.c2)
   : +- 'UnresolvedRelation `t2`
   +- 'UnresolvedRelation `t1`

== Analyzed Logical Plan ==
c1: int
Project [c1#17]
+- Filter predicate-subquery#21 [(c1#17 = c2#10)]
   :  +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)]   <== This 
correlated predicate is incorrectly moved above the LIMIT
   : +- GlobalLimit 1
   :+- LocalLimit 1
   :   +- Project [1 AS 1#26, c2#10]
   :  +- SubqueryAlias t2
   : +- Project [value#8 AS c2#10]
   :+- LocalRelation [value#8]
   +- SubqueryAlias t1
  +- Project [value#15 AS c1#17]
 +- LocalRelation [value#15]

By rewriting the correlated predicate in the subquery in Analysis phase from 
below the LIMIT 1 operation to above it causing the scan of the subquery table 
to return only 1 row. The correct semantic is the LIMIT 1 must be applied on 
the subquery for each input value from the parent table.

> Correlated subqueries containing LIMIT return incorrect results
> ---
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results

2016-07-29 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400075#comment-15400075
 ] 

Nattavut Sutyanyong commented on SPARK-16804:
-

I am working on a fix. Could you please assign this JIRA to nsyca?

> Correlated subqueries containing LIMIT return incorrect results
> ---
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results

2016-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400068#comment-15400068
 ] 

Apache Spark commented on SPARK-16804:
--

User 'nsyca' has created a pull request for this issue:
https://github.com/apache/spark/pull/14411

> Correlated subqueries containing LIMIT return incorrect results
> ---
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org