[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400150#comment-15400150 ] Nattavut Sutyanyong commented on SPARK-16804: - To demonstrate that this fix does not unnecessarily block the "good" cases (where LIMIT is present but NOT on the correlated path), here is an example, which produce the same result set in both with and without this proposed fix. scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 limit 1) where t1.c1=t2.c2)").show +---+ | c1| +---+ | 1| +---+ > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400105#comment-15400105 ] Nattavut Sutyanyong commented on SPARK-16804: - This fix blocks any correlated subquery when there is a LIMIT operation on the path from the parent table to the correlated predicate. We may consider relaxing this restriction once we have a better support on processing correlated subquery in run-time. SPARK-13417 is an umbrella task to track this effort. Note that if the LIMIT is not in the correlated path, Spark returns correct result. Example: sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2) and exists (select 1 from t2 LIMIT 1)").show will return both rows from T1, which is correctly handled with and without this proposed fix. This fix will change the behaviour of the query sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").show to return an error from the Analysis phase as shown below: org.apache.spark.sql.AnalysisException: Accessing outer query column is not allowed in a LIMIT: LocalLimit 1 ... > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400090#comment-15400090 ] Nattavut Sutyanyong commented on SPARK-16804: - scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").explain(true) == Parsed Logical Plan == 'Project ['c1] +- 'Filter exists#21 : +- 'SubqueryAlias exists#21 : +- 'GlobalLimit 1 :+- 'LocalLimit 1 : +- 'Project [unresolvedalias(1, None)] : +- 'Filter ('t1.c1 = 't2.c2) : +- 'UnresolvedRelation `t2` +- 'UnresolvedRelation `t1` == Analyzed Logical Plan == c1: int Project [c1#17] +- Filter predicate-subquery#21 [(c1#17 = c2#10)] : +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)] <== This correlated predicate is incorrectly moved above the LIMIT : +- GlobalLimit 1 :+- LocalLimit 1 : +- Project [1 AS 1#26, c2#10] : +- SubqueryAlias t2 : +- Project [value#8 AS c2#10] :+- LocalRelation [value#8] +- SubqueryAlias t1 +- Project [value#15 AS c1#17] +- LocalRelation [value#15] By rewriting the correlated predicate in the subquery in Analysis phase from below the LIMIT 1 operation to above it causing the scan of the subquery table to return only 1 row. The correct semantic is the LIMIT 1 must be applied on the subquery for each input value from the parent table. > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400075#comment-15400075 ] Nattavut Sutyanyong commented on SPARK-16804: - I am working on a fix. Could you please assign this JIRA to nsyca? > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400068#comment-15400068 ] Apache Spark commented on SPARK-16804: -- User 'nsyca' has created a pull request for this issue: https://github.com/apache/spark/pull/14411 > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org