[jira] [Created] (SPARK-48557) Support scalar subquery with group-by on column equal to constant
Jack Chen created SPARK-48557: - Summary: Support scalar subquery with group-by on column equal to constant Key: SPARK-48557 URL: https://issues.apache.org/jira/browse/SPARK-48557 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Jack Chen For example, this query is currently unsupported but can be allowed, because the filter y2 = 1 guarantees that the group-by has only one value. {code:java} select *, (select count(*) from y where x1 = y1 and y2 = 1 group by y2) from x; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48503) Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results
[ https://issues.apache.org/jira/browse/SPARK-48503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-48503: -- Parent: SPARK-35553 Issue Type: Sub-task (was: Bug) > Scalar subquery with group-by and non-equality predicate incorrectly allowed, > wrong results > --- > > Key: SPARK-48503 > URL: https://issues.apache.org/jira/browse/SPARK-48503 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Priority: Major > > This query is not legal and should give an error, but instead we incorrectly > allow it and it returns wrong results. > {code:java} > create table x(x1 int, x2 int); > insert into x values (1, 1); > create table y(y1 int, y2 int); > insert into y values (2, 2), (3, 3); > select *, (select count(*) from y where y1 > x1 group by y1) from x; {code} > It returns two rows, even though there's only one row of x. > The correct result is an error: more than one row returned by a subquery used > as an expression (as seen in postgres for example) > > This is a longstanding bug. The bug is in CheckAnalysis in > {{{}checkAggregateInScalarSubquery{}}}. It allows grouping columns that are > present in correlation predicates, but doesn’t check whether those predicates > are equalities - because when that code was written, non-equality > correlation wasn’t allowed. Therefore, it looks like this bug has existed > since non-equality correlation was added (~2 years ago). > > Various other expressions that are not equi-joins between the inner and outer > fields hit this too, e.g. `where y1 + y2 = x1 group by y1`. > Another bugged case is if the correlation condition is an equality but it's > under another operator like an OUTER JOIN or UNION. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48501) Loosen `correlated scalar subqueries must be aggregated` error by doing runtime check for scalar subqueries output rowcount
[ https://issues.apache.org/jira/browse/SPARK-48501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-48501: -- Description: Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw an error {{{}Correlated scalar subqueries must be aggregated{}}}. This check is often too restrictive, there are many cases where it should actually be runnable even though we don’t know it - e.g. unique keys or functional dependencies might ensure that there's only one row. To handle these cases, it’s better to do the check at runtime instead of statically. This could be implemented as a special aggregate operator that throws exception on >=2 rows input, a “single join” operator that throws an exception when >= 2 rows match, or something similar. There are also cases where we were incorrectly allowing queries before that returned wrong results, and should have been rejected as invalid (e.g. SPARK-48503, SPARK-18504). Doing the check at runtime would help avoid those bugs. Current workarounds: Users can add an aggregate like {{any_value()}} or {{first()}} to the output of the subquery, or users can add {{limit 1}} was: Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw an error {{{}Correlated scalar subqueries must be aggregated{}}}. This check is often too restrictive, there are many cases where it should actually be runnable even though we don’t know it - e.g. unique keys or functional dependencies might ensure that there's only one row. To handle these cases, it’s better to do the check at runtime instead of statically. This could be implemented as a special aggregate operator that throws exception on >=2 rows input, a “single join” operator that throws an exception when >= 2 rows match, or something similar. There are also cases where we were incorrectly allowing queries before that returned wrong results, and should have been rejected as invalid. Doing the check at runtime would help avoid those bugs. Current workarounds: Users can add an aggregate like {{any_value()}} or {{first()}} to the output of the subquery, or users can add {{limit 1}} > Loosen `correlated scalar subqueries must be aggregated` error by doing > runtime check for scalar subqueries output rowcount > --- > > Key: SPARK-48501 > URL: https://issues.apache.org/jira/browse/SPARK-48501 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Priority: Major > > Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw > an error {{{}Correlated scalar subqueries must be aggregated{}}}. > This check is often too restrictive, there are many cases where it should > actually be runnable even though we don’t know it - e.g. unique keys or > functional dependencies might ensure that there's only one row. > To handle these cases, it’s better to do the check at runtime instead of > statically. This could be implemented as a special aggregate operator that > throws exception on >=2 rows input, a “single join” operator that throws an > exception when >= 2 rows match, or something similar. > There are also cases where we were incorrectly allowing queries before that > returned wrong results, and should have been rejected as invalid (e.g. > SPARK-48503, SPARK-18504). Doing the check at runtime would help avoid those > bugs. > Current workarounds: Users can add an aggregate like {{any_value()}} or > {{first()}} to the output of the subquery, or users can add {{limit 1}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48503) Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results
Jack Chen created SPARK-48503: - Summary: Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results Key: SPARK-48503 URL: https://issues.apache.org/jira/browse/SPARK-48503 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Jack Chen This query is not legal and should give an error, but instead we incorrectly allow it and it returns wrong results. {code:java} create table x(x1 int, x2 int); insert into x values (1, 1); create table y(y1 int, y2 int); insert into y values (2, 2), (3, 3); select *, (select count(*) from y where y1 > x1 group by y1) from x; {code} It returns two rows, even though there's only one row of x. The correct result is an error: more than one row returned by a subquery used as an expression (as seen in postgres for example) This is a longstanding bug. The bug is in CheckAnalysis in {{{}checkAggregateInScalarSubquery{}}}. It allows grouping columns that are present in correlation predicates, but doesn’t check whether those predicates are equalities - because when that code was written, non-equality correlation wasn’t allowed. Therefore, it looks like this bug has existed since non-equality correlation was added (~2 years ago). Various other expressions that are not equi-joins between the inner and outer fields hit this too, e.g. `where y1 + y2 = x1 group by y1`. Another bugged case is if the correlation condition is an equality but it's under another operator like an OUTER JOIN or UNION. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48501) Loosen `correlated scalar subqueries must be aggregated` error by doing runtime check for scalar subqueries output rowcount
Jack Chen created SPARK-48501: - Summary: Loosen `correlated scalar subqueries must be aggregated` error by doing runtime check for scalar subqueries output rowcount Key: SPARK-48501 URL: https://issues.apache.org/jira/browse/SPARK-48501 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Jack Chen Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw an error {{{}Correlated scalar subqueries must be aggregated{}}}. This check is often too restrictive, there are many cases where it should actually be runnable even though we don’t know it - e.g. unique keys or functional dependencies might ensure that there's only one row. To handle these cases, it’s better to do the check at runtime instead of statically. This could be implemented as a special aggregate operator that throws exception on >=2 rows input, a “single join” operator that throws an exception when >= 2 rows match, or something similar. There are also cases where we were incorrectly allowing queries before that returned wrong results, and should have been rejected as invalid. Doing the check at runtime would help avoid those bugs. Current workarounds: Users can add an aggregate like {{any_value()}} or {{first()}} to the output of the subquery, or users can add {{limit 1}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36115) Handle the COUNT bug for correlated IN/EXISTS subquery
[ https://issues.apache.org/jira/browse/SPARK-36115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen resolved SPARK-36115. --- Resolution: Fixed This was fixed by https://github.com/apache/spark/pull/43111 > Handle the COUNT bug for correlated IN/EXISTS subquery > -- > > Key: SPARK-36115 > URL: https://issues.apache.org/jira/browse/SPARK-36115 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > Correlated IN/EXISTS subqueries are also subject to the COUNT bug which is > not handled. > {code:sql} > create view t1(c1, c2) as values (0, 1), (1, 2) > create view t2(c1, c2) as values (0, 2), (0, 3) > -- Example 1: IN subquery > select * from t1 where c1 in (select count(*) + 1 from t2 where t1.c1 = t2.c1) > -- Correct answer: (1, 2) > +---+---+ > |c1 |c2 | > +---+---+ > +---+---+ > -- Example 2: EXISTS subquery > select * from t1 where exists (select count(*) from t2 where t1.c1 = t2.c1) > -- Correct answer: [(0, 1), (1, 2)] > +---+---+ > |c1 |c2 | > +---+---+ > |0 |1 | > +---+---+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47716) SQLQueryTestSuite flaky case due to view name conflict
[ https://issues.apache.org/jira/browse/SPARK-47716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-47716: -- Description: In SQLQueryTestSuite, the test case "Test logic for determining whether a query is semantically sorted" can sometimes fail with an error {{Cannot create table or view `main`.`default`.`t1` because it already exists.}} if run concurrently with other sql test cases that also create tables with the same name. was: In SQLQueryTestSuite, the test case "Test logic for determining whether a query is semantically sorted" sometimes has a flaky failure with an error {{Cannot create table or view `main`.`default`.`t1` because it already exists.}} {{{}{}}}This seems to be due to sql test cases that also create tables with the same name. > SQLQueryTestSuite flaky case due to view name conflict > -- > > Key: SPARK-47716 > URL: https://issues.apache.org/jira/browse/SPARK-47716 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Priority: Major > Labels: pull-request-available > > In SQLQueryTestSuite, the test case "Test logic for determining whether a > query is semantically sorted" can sometimes fail with an error > {{Cannot create table or view `main`.`default`.`t1` because it already > exists.}} > if run concurrently with other sql test cases that also create tables with > the same name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47716) SQLQueryTestSuite flaky case due to view name conflict
Jack Chen created SPARK-47716: - Summary: SQLQueryTestSuite flaky case due to view name conflict Key: SPARK-47716 URL: https://issues.apache.org/jira/browse/SPARK-47716 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Jack Chen In SQLQueryTestSuite, the test case "Test logic for determining whether a query is semantically sorted" sometimes has a flaky failure with an error {{Cannot create table or view `main`.`default`.`t1` because it already exists.}} {{{}{}}}This seems to be due to sql test cases that also create tables with the same name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47700) Fix formatting of error messages with treeNode
Jack Chen created SPARK-47700: - Summary: Fix formatting of error messages with treeNode Key: SPARK-47700 URL: https://issues.apache.org/jira/browse/SPARK-47700 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Jack Chen Example: We are concatenating the plan without any separation: `in this locationFilter (dept_id#652`. We should add a colon and space or newline in between: {code:java} [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED] Unsupported subquery expression: Accessing outer query column is not allowed in this locationFilter (dept_id#652 = outer(dept_id#646)) +- SubqueryAlias dept +- View (`DEPT`, [dept_id#652, dept_name#653, state#654]) +- Project [cast(dept_id#655 as int) AS dept_id#652, cast(dept_name#656 as string) AS dept_name#653, cast(state#657 as string) AS state#654] +- Project [dept_id#655, dept_name#656, state#657] +- SubqueryAlias DEPT +- LocalRelation [dept_id#655, dept_name#656, state#657] . SQLSTATE: 0A000 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47525) Support subquery correlation joining on map attributes
Jack Chen created SPARK-47525: - Summary: Support subquery correlation joining on map attributes Key: SPARK-47525 URL: https://issues.apache.org/jira/browse/SPARK-47525 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Jack Chen Currently, when a subquery is correlated on a condition like `outer_map[1] = inner_map[1]`, DecorrelateInnerQuery generates a join on the map itself, which is unsupported, so the query cannot run - for example: {{scala> Seq(Map(0 -> 0)).toDF.createOrReplaceTempView("v")scala> sql("select v1.value[0] from v v1 where v1.value[0] > (select avg(v2.value[0]) from v v2 where v1.value[1] = v2.value[1])").explain org.apache.spark.sql.AnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE] Unsupported subquery expression: Correlated column reference 'v1.value' cannot be map type. SQLSTATE: 0A000; line 1 pos 49 at org.apache.spark.sql.errors.QueryCompilationErrors$.unsupportedCorrelatedReferenceDataTypeError(QueryCompilationErrors.scala:2463) ...}} However, if we rewrite the query to pull out the map access `outer_map[1]` into the outer plan, it succeeds: {{scala> sql("""with tmp as ( select value[0] as value0, value[1] as value1 from v ) select v1.value0 from tmp v1 where v1.value0 > (select avg(v2.value0) from tmp v2 where v1.value1 = v2.value1)""").explain}} Another point that can be improved is that, even if the data type supports join, we still don’t need to join on the full attribute, and we can get a better plan by doing the same rewrite to pull out the extract expression. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47525) Support subquery correlation joining on map attributes
[ https://issues.apache.org/jira/browse/SPARK-47525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-47525: -- Description: Currently, when a subquery is correlated on a condition like `outer_map[1] = inner_map[1]`, DecorrelateInnerQuery generates a join on the map itself, which is unsupported, so the query cannot run - for example: {code:java} scala> Seq(Map(0 -> 0)).toDF.createOrReplaceTempView("v")scala> sql("select v1.value[0] from v v1 where v1.value[0] > (select avg(v2.value[0]) from v v2 where v1.value[1] = v2.value[1])").explain org.apache.spark.sql.AnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE] Unsupported subquery expression: Correlated column reference 'v1.value' cannot be map type. SQLSTATE: 0A000; line 1 pos 49 at org.apache.spark.sql.errors.QueryCompilationErrors$.unsupportedCorrelatedReferenceDataTypeError(QueryCompilationErrors.scala:2463) ... {code} However, if we rewrite the query to pull out the map access `outer_map[1]` into the outer plan, it succeeds: {code:java} scala> sql("""with tmp as ( select value[0] as value0, value[1] as value1 from v ) select v1.value0 from tmp v1 where v1.value0 > (select avg(v2.value0) from tmp v2 where v1.value1 = v2.value1)""").explain{code} Another point that can be improved is that, even if the data type supports join, we still don’t need to join on the full attribute, and we can get a better plan by doing the same rewrite to pull out the extract expression. was: Currently, when a subquery is correlated on a condition like `outer_map[1] = inner_map[1]`, DecorrelateInnerQuery generates a join on the map itself, which is unsupported, so the query cannot run - for example: {{scala> Seq(Map(0 -> 0)).toDF.createOrReplaceTempView("v")scala> sql("select v1.value[0] from v v1 where v1.value[0] > (select avg(v2.value[0]) from v v2 where v1.value[1] = v2.value[1])").explain org.apache.spark.sql.AnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE] Unsupported subquery expression: Correlated column reference 'v1.value' cannot be map type. SQLSTATE: 0A000; line 1 pos 49 at org.apache.spark.sql.errors.QueryCompilationErrors$.unsupportedCorrelatedReferenceDataTypeError(QueryCompilationErrors.scala:2463) ...}} However, if we rewrite the query to pull out the map access `outer_map[1]` into the outer plan, it succeeds: {{scala> sql("""with tmp as ( select value[0] as value0, value[1] as value1 from v ) select v1.value0 from tmp v1 where v1.value0 > (select avg(v2.value0) from tmp v2 where v1.value1 = v2.value1)""").explain}} Another point that can be improved is that, even if the data type supports join, we still don’t need to join on the full attribute, and we can get a better plan by doing the same rewrite to pull out the extract expression. > Support subquery correlation joining on map attributes > -- > > Key: SPARK-47525 > URL: https://issues.apache.org/jira/browse/SPARK-47525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jack Chen >Priority: Major > > Currently, when a subquery is correlated on a condition like `outer_map[1] = > inner_map[1]`, DecorrelateInnerQuery generates a join on the map itself, > which is unsupported, so the query cannot run - for example: > > {code:java} > scala> Seq(Map(0 -> 0)).toDF.createOrReplaceTempView("v")scala> sql("select > v1.value[0] from v v1 where v1.value[0] > (select avg(v2.value[0]) from v v2 > where v1.value[1] = v2.value[1])").explain > org.apache.spark.sql.AnalysisException: > [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE] > Unsupported subquery expression: Correlated column reference 'v1.value' > cannot be map type. SQLSTATE: 0A000; line 1 pos 49 > at > org.apache.spark.sql.errors.QueryCompilationErrors$.unsupportedCorrelatedReferenceDataTypeError(QueryCompilationErrors.scala:2463) > ... {code} > However, if we rewrite the query to pull out the map access `outer_map[1]` > into the outer plan, it succeeds: > > {code:java} > scala> sql("""with tmp as ( > select value[0] as value0, value[1] as value1 from v > ) > select v1.value0 from tmp v1 where v1.value0 > (select avg(v2.value0) from > tmp v2 where v1.value1 = v2.value1)""").explain{code} > Another point that can be improved is that, even if the data type supports > join, we still don’t need to join on the full attribute, and we can get a > better plan by doing the same rewrite to pull out the extract expression. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Comment Edited] (SPARK-36113) Unify the logic to handle COUNT bug for scalar and lateral subqueries
[ https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824521#comment-17824521 ] Jack Chen edited comment on SPARK-36113 at 3/7/24 8:52 PM: --- The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row - that's added in constructLeftJoins. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. Perhaps constructLeftJoins can check if there's a left DomainJoin at the top of the subquery plan and flip it to inner - but then we also need to make sure the expression for count bug handling {{if(isnull...), 0, cnt)}} is in the right place, in the newly built outer join. In general, I think we want to move the logic for determining whether there's a count bug earlier to DecorrelateInnerQuery, which unifies the logic between scalar and non-scalar subqueries, and allows us to avoid a whole class of bugs with the doing the scalar subquery count bug handling later - see e.g. [https://github.com/apache/spark/pull/40811] and [https://github.com/apache/spark/pull/45125]. And see [https://docs.google.com/document/d/1YCce0DtJ6NkMP1QM1nnfYvkiH1CkHoGyMjKQ8MX5bUM/edit] for more on why this decorrelation happens in two steps. was (Author: JIRAUSER299035): The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row - that's added in constructLeftJoins. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. Perhaps constructLeftJoins can check if there's a left DomainJoin at the top of the subquery plan and flip it to inner. This way all the logic for determining whether there's a count bug happens earlier in DecorrelateInnerQuery, which unifies the logic and allows us to avoid a whole class of bugs with the doing the scalar subquery count bug handling later - see e.g. [https://github.com/apache/spark/pull/40811] and [https://github.com/apache/spark/pull/45125]. And see [https://docs.google.com/document/d/1YCce0DtJ6NkMP1QM1nnfYvkiH1CkHoGyMjKQ8MX5bUM/edit] for more on why this decorrelation happens in two steps. > Unify the logic to handle COUNT bug for scalar and lateral subqueries > - > > Key: SPARK-36113 > URL: https://issues.apache.org/jira/browse/SPARK-36113 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > Currently, for scalar subqueries, the logic to handle the count bug is in > `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral > subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. > We should unify them without introducing additional left outer joins for > scalar subqueries. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36113) Unify the logic to handle COUNT bug for scalar and lateral subqueries
[ https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824521#comment-17824521 ] Jack Chen edited comment on SPARK-36113 at 3/7/24 8:47 PM: --- The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row - that's added in constructLeftJoins. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. Perhaps constructLeftJoins can check if there's a left DomainJoin at the top of the subquery plan and flip it to inner. This way all the logic for determining whether there's a count bug happens earlier in DecorrelateInnerQuery, which unifies the logic and allows us to avoid a whole class of bugs with the doing the scalar subquery count bug handling later - see e.g. [https://github.com/apache/spark/pull/40811] and [https://github.com/apache/spark/pull/45125]. And see [https://docs.google.com/document/d/1YCce0DtJ6NkMP1QM1nnfYvkiH1CkHoGyMjKQ8MX5bUM/edit] for more on why this decorrelation happens in two steps. was (Author: JIRAUSER299035): The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row - that's added in constructLeftJoins. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. Adding that outer join still needs to happen in the same place (rewriteDomainJoins), but perhaps we can move the logic of checking whether the subquery is affected by count bug all to DecorrelateInnerQuery, and that just leaves a flag for rewriteDomainJoins/constructLeftJoins later on. > Unify the logic to handle COUNT bug for scalar and lateral subqueries > - > > Key: SPARK-36113 > URL: https://issues.apache.org/jira/browse/SPARK-36113 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > Currently, for scalar subqueries, the logic to handle the count bug is in > `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral > subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. > We should unify them without introducing additional left outer joins for > scalar subqueries. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36113) Unify the logic to handle COUNT bug for scalar and lateral subqueries
[ https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824521#comment-17824521 ] Jack Chen edited comment on SPARK-36113 at 3/7/24 8:26 PM: --- The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row - that's added in constructLeftJoins. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. Adding that outer join still needs to happen in the same place (rewriteDomainJoins), but perhaps we can move the logic of checking whether the subquery is affected by count bug all to DecorrelateInnerQuery, and that just leaves a flag for rewriteDomainJoins/constructLeftJoins later on. was (Author: JIRAUSER299035): The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. Adding that outer join still needs to happen in the same place (rewriteDomainJoins), but perhaps we can move the logic of checking whether the subquery is affected by count bug all to DecorrelateInnerQuery, and that just leaves a flag for rewriteDomainJoins later on. > Unify the logic to handle COUNT bug for scalar and lateral subqueries > - > > Key: SPARK-36113 > URL: https://issues.apache.org/jira/browse/SPARK-36113 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > Currently, for scalar subqueries, the logic to handle the count bug is in > `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral > subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. > We should unify them without introducing additional left outer joins for > scalar subqueries. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36113) Unify the logic to handle COUNT bug for scalar and lateral subqueries
[ https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824521#comment-17824521 ] Jack Chen edited comment on SPARK-36113 at 3/7/24 8:09 PM: --- The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. Adding that outer join still needs to happen in the same place (rewriteDomainJoins), but perhaps we can move the logic of checking whether the subquery is affected by count bug all to DecorrelateInnerQuery, and that just leaves a flag for rewriteDomainJoins later on. was (Author: JIRAUSER299035): The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. > Unify the logic to handle COUNT bug for scalar and lateral subqueries > - > > Key: SPARK-36113 > URL: https://issues.apache.org/jira/browse/SPARK-36113 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > Currently, for scalar subqueries, the logic to handle the count bug is in > `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral > subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. > We should unify them without introducing additional left outer joins for > scalar subqueries. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36113) Unify the logic to handle COUNT bug for scalar and lateral subqueries
[ https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824521#comment-17824521 ] Jack Chen commented on SPARK-36113: --- The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. > Unify the logic to handle COUNT bug for scalar and lateral subqueries > - > > Key: SPARK-36113 > URL: https://issues.apache.org/jira/browse/SPARK-36113 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > Currently, for scalar subqueries, the logic to handle the count bug is in > `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral > subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. > We should unify them without introducing additional left outer joins for > scalar subqueries. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46446) Correctness bug in correlated subquery with OFFSET
[ https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-46446: -- Description: Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.) It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS It's easy to repro with a query like {code:java} create table x(x1 int, x2 int); insert into x values (1, 1), (2, 2); create table y(y1 int, y2 int); insert into y values (1, 1), (1, 2), (2, 4); select * from x where exists (select * from y where x1 = y1 limit 1 offset 2){code} Correct result: empty set, see postgres: [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] Spark result: Array([2,2]) The [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. (The bug was initially found by https://github.com/apache/spark/pull/44084) I'll work on both: * Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit). * Adding a feature flag to enable/disable offset in subquery support was: Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.) It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS It's easy to repro with a query like {code:java} create table x(x1 int, x2 int); insert into x values (1, 1), (2, 2); create table y(y1 int, y2 int); insert into y values (1, 1), (1, 2), (2, 4); select * from x where exists (select * from y where x1 = y1 limit 1 offset 2){code} Correct result: empty set, see postgres: [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] Spark result: Array([2,2]) The [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. I'll work on both: * Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit). * Adding a feature flag to enable/disable offset in subquery support > Correctness bug in correlated subquery with OFFSET > -- > > Key: SPARK-46446 > URL: https://issues.apache.org/jira/browse/SPARK-46446 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Priority: Major > > Subqueries with correlation under LIMIT with OFFSET have a correctness bug, > introduced recently when support for correlation under OFFSET was enabled but > were not handled correctly. (So we went from unsupported, query throws error > -> wrong results.) > It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS > > It's easy to repro with a query like > {code:java} > create table x(x1 int, x2 int); > insert into x values (1, 1), (2, 2); > create table y(y1 int, y2 int); > insert into y values (1, 1), (1, 2), (2, 4); > select * from x where exists (select * from y where x1 = y1 limit 1 offset > 2){code} > Correct result: empty set, see postgres: > [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] > Spark result: Array([2,2]) > > The > [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] > where it was introduced added a test for it, but the golden file results for > the test actually were incorrect and we didn't notice. (The bug was initially > found by https://github.com/apache/spark/pull/44084) > I'll work on both: > * Adding support for offset in DecorrelateInnerQuery (the transformation is > into a filter on row_number window function, similar to limit). > * Adding a feature flag to enable/disable offset in subquery support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46446) Correctness bug in correlated subquery with OFFSET
[ https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-46446: -- Description: Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.) It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS It's easy to repro with a query like {code:java} create table x(x1 int, x2 int); insert into x values (1, 1), (2, 2); create table y(y1 int, y2 int); insert into y values (1, 1), (1, 2), (2, 4); select * from x where exists (select * from y where x1 = y1 limit 1 offset 2){code} Correct result: empty set, see postgres: [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] Spark result: Array([2,2]) The [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. I'll work on both: * Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit). * Adding a feature flag to enable/disable offset in subquery support was: Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.) It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS It's easy to repro with a query like {code:java} SELECT * FROM emp join lateral (SELECT dept.dept_name FROM dept WHERE emp.dept_id = dept.dept_id LIMIT 5 OFFSET 3); {code} The [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. I'll work on both: * Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit). * Adding a feature flag to enable/disable offset in subquery support > Correctness bug in correlated subquery with OFFSET > -- > > Key: SPARK-46446 > URL: https://issues.apache.org/jira/browse/SPARK-46446 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Priority: Major > > Subqueries with correlation under LIMIT with OFFSET have a correctness bug, > introduced recently when support for correlation under OFFSET was enabled but > were not handled correctly. (So we went from unsupported, query throws error > -> wrong results.) > It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS > > It's easy to repro with a query like > {code:java} > create table x(x1 int, x2 int); > insert into x values (1, 1), (2, 2); > create table y(y1 int, y2 int); > insert into y values (1, 1), (1, 2), (2, 4); > select * from x where exists (select * from y where x1 = y1 limit 1 offset > 2){code} > Correct result: empty set, see postgres: > [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] > Spark result: Array([2,2]) > > The > [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] > where it was introduced added a test for it, but the golden file results for > the test actually were incorrect and we didn't notice. > I'll work on both: > * Adding support for offset in DecorrelateInnerQuery (the transformation is > into a filter on row_number window function, similar to limit). > * Adding a feature flag to enable/disable offset in subquery support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46446) Correctness bug in correlated subquery with OFFSET
Jack Chen created SPARK-46446: - Summary: Correctness bug in correlated subquery with OFFSET Key: SPARK-46446 URL: https://issues.apache.org/jira/browse/SPARK-46446 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Jack Chen Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.) It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS It's easy to repro with a query like {code:java} SELECT * FROM emp join lateral (SELECT dept.dept_name FROM dept WHERE emp.dept_id = dept.dept_id LIMIT 5 OFFSET 3); {code} The [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. I'll work on both: * Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit). * Adding a feature flag to enable/disable offset in subquery support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45009) Correlated EXISTS subqueries in join ON condition unsupported and fail with internal error
[ https://issues.apache.org/jira/browse/SPARK-45009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761003#comment-17761003 ] Jack Chen commented on SPARK-45009: --- https://github.com/apache/spark/pull/42725 > Correlated EXISTS subqueries in join ON condition unsupported and fail with > internal error > -- > > Key: SPARK-45009 > URL: https://issues.apache.org/jira/browse/SPARK-45009 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > They are not handled in decorrelation, and we also don’t have any checks to > block them, so these queries have outer references in the query plan leading > to internal errors: > {code:java} > CREATE TEMP VIEW x(x1, x2) AS VALUES (0, 1), (1, 2); > CREATE TEMP VIEW y(y1, y2) AS VALUES (0, 2), (0, 3); > CREATE TEMP VIEW z(z1, z2) AS VALUES (0, 2), (0, 3); > select * from x left join y on x1 = y1 and exists (select * from z where z1 = > x1) > Error occurred during query planning: > org.apache.spark.sql.catalyst.plans.logical.Filter cannot be cast to > org.apache.spark.sql.execution.SparkPlan {code} > PullupCorrelatedPredicates#apply and RewritePredicateSubquery only handle > subqueries in UnaryNode, it seems to assume that they cannot occur elsewhere, > like in a join ON condition. > We will need to decide whether to block them properly in analysis (i.e. give > a proper error for them), or see if we can add support for them. > Also note, scalar subqueries in the ON condition are unsupported too but > return a proper error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45009) Correlated EXISTS subqueries in join ON condition unsupported and fail with internal error
Jack Chen created SPARK-45009: - Summary: Correlated EXISTS subqueries in join ON condition unsupported and fail with internal error Key: SPARK-45009 URL: https://issues.apache.org/jira/browse/SPARK-45009 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Jack Chen They are not handled in decorrelation, and we also don’t have any checks to block them, so these queries have outer references in the query plan leading to internal errors: {code:java} CREATE TEMP VIEW x(x1, x2) AS VALUES (0, 1), (1, 2); CREATE TEMP VIEW y(y1, y2) AS VALUES (0, 2), (0, 3); CREATE TEMP VIEW z(z1, z2) AS VALUES (0, 2), (0, 3); select * from x left join y on x1 = y1 and exists (select * from z where z1 = x1) Error occurred during query planning: org.apache.spark.sql.catalyst.plans.logical.Filter cannot be cast to org.apache.spark.sql.execution.SparkPlan {code} PullupCorrelatedPredicates#apply and RewritePredicateSubquery only handle subqueries in UnaryNode, it seems to assume that they cannot occur elsewhere, like in a join ON condition. We will need to decide whether to block them properly in analysis (i.e. give a proper error for them), or see if we can add support for them. Also note, scalar subqueries in the ON condition are unsupported too but return a proper error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44550) Wrong semantics for null IN (empty list)
[ https://issues.apache.org/jira/browse/SPARK-44550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-44550: -- Fix Version/s: (was: 3.5.0) > Wrong semantics for null IN (empty list) > > > Key: SPARK-44550 > URL: https://issues.apache.org/jira/browse/SPARK-44550 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > > {{null IN (empty list)}} incorrectly evaluates to null, when it should > evaluate to false. (The reason it should be false is because a IN (b1, b2) is > defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR > which is false. This is specified by ANSI SQL.) > Many places in Spark execution (In, InSet, InSubquery) and optimization > (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that > the Spark behavior for the null IN (empty list) is inconsistent in some > places - literal IN lists generally return null (incorrect), while IN/NOT IN > subqueries mostly return false/true, respectively (correct) in this case. > This is a longstanding correctness issue which has existed since null support > for IN expressions was first added to Spark. > Doc with more details: > [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44551) Wrong semantics for null IN (empty list) - IN expression execution
Jack Chen created SPARK-44551: - Summary: Wrong semantics for null IN (empty list) - IN expression execution Key: SPARK-44551 URL: https://issues.apache.org/jira/browse/SPARK-44551 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Jack Chen {{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate to false. (The reason it should be false is because a IN (b1, b2) is defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is false. This is specified by ANSI SQL.) Many places in Spark execution (In, InSet, InSubquery) and optimization (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that the Spark behavior for the null IN (empty list) is inconsistent in some places - literal IN lists generally return null (incorrect), while IN/NOT IN subqueries mostly return false/true, respectively (correct) in this case. This is a longstanding correctness issue which has existed since null support for IN expressions was first added to Spark. Doc with more details: [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44431) Wrong semantics for null IN (empty list) - optimization rules
[ https://issues.apache.org/jira/browse/SPARK-44431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-44431: -- Parent: SPARK-44550 Issue Type: Sub-task (was: Bug) > Wrong semantics for null IN (empty list) - optimization rules > - > > Key: SPARK-44431 > URL: https://issues.apache.org/jira/browse/SPARK-44431 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.5.0 > > > {{null IN (empty list)}} incorrectly evaluates to null, when it should > evaluate to false. (The reason it should be false is because a IN (b1, b2) is > defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR > which is false. This is specified by ANSI SQL.) > Many places in Spark execution (In, InSet, InSubquery) and optimization > (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that > the Spark behavior for the null IN (empty list) is inconsistent in some > places - literal IN lists generally return null (incorrect), while IN/NOT IN > subqueries mostly return false/true, respectively (correct) in this case. > This is a longstanding correctness issue which has existed since null support > for IN expressions was first added to Spark. > Doc with more details: > [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44431) Wrong semantics for null IN (empty list) - optimization rules
[ https://issues.apache.org/jira/browse/SPARK-44431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-44431: -- Summary: Wrong semantics for null IN (empty list) - optimization rules (was: Wrong semantics for null IN (empty list)) > Wrong semantics for null IN (empty list) - optimization rules > - > > Key: SPARK-44431 > URL: https://issues.apache.org/jira/browse/SPARK-44431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.5.0 > > > {{null IN (empty list)}} incorrectly evaluates to null, when it should > evaluate to false. (The reason it should be false is because a IN (b1, b2) is > defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR > which is false. This is specified by ANSI SQL.) > Many places in Spark execution (In, InSet, InSubquery) and optimization > (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that > the Spark behavior for the null IN (empty list) is inconsistent in some > places - literal IN lists generally return null (incorrect), while IN/NOT IN > subqueries mostly return false/true, respectively (correct) in this case. > This is a longstanding correctness issue which has existed since null support > for IN expressions was first added to Spark. > Doc with more details: > [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44550) Wrong semantics for null IN (empty list)
Jack Chen created SPARK-44550: - Summary: Wrong semantics for null IN (empty list) Key: SPARK-44550 URL: https://issues.apache.org/jira/browse/SPARK-44550 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Jack Chen Assignee: Jack Chen Fix For: 3.5.0 {{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate to false. (The reason it should be false is because a IN (b1, b2) is defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is false. This is specified by ANSI SQL.) Many places in Spark execution (In, InSet, InSubquery) and optimization (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that the Spark behavior for the null IN (empty list) is inconsistent in some places - literal IN lists generally return null (incorrect), while IN/NOT IN subqueries mostly return false/true, respectively (correct) in this case. This is a longstanding correctness issue which has existed since null support for IN expressions was first added to Spark. Doc with more details: [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator
[ https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-8: -- Affects Version/s: 3.5.0 (was: 3.4.0) > Wrong results for dense_rank() <= k from InferWindowGroupLimit and > DenseRankLimitIterator > - > > Key: SPARK-8 > URL: https://issues.apache.org/jira/browse/SPARK-8 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jack Chen >Priority: Major > > Top-k filters on a dense_rank() window function return wrong results, due to > a bug in optimization InferWindowGroupLimit, specifically in the code for > DenseRankLimitIterator, introduced in > https://issues.apache.org/jira/browse/SPARK-37099. > Repro: > {code:java} > create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, > 1), (2, 1), (2, 2); > select * from (select *, dense_rank() over (partition by p order by o) as rnk > from t1) where rnk = 1;{code} > Spark result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1]{code} > Correct result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1] > [2,1,1]{code} > > The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state > properly when transitioning from one window partition to the next. {{reset}} > only resets {{{}rank = 0{}}}, what it is missing is to reset > {{{}currentRankRow = null{}}}. This means that when processing the second and > later window partitions, the rank incorrectly gets incremented based on > comparing the ordering of the last row of the previous partition to the first > row of the new partition. > This means that a dense_rank window func that has more than one window > partition and more than one row with dense_rank = 1 in the second or later > partitions can give wrong results when optimized. > ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the > first row in the new partition will try to increment rank, but increment it > by the value of count which is 0, so it happens to work by accident). > Unfortunately, tests for the optimization only had a single row per rank, so > did not catch the bug as the bug requires multiple rows per rank. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator
[ https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743594#comment-17743594 ] Jack Chen commented on SPARK-8: --- Fix PR: https://github.com/apache/spark/pull/42026 > Wrong results for dense_rank() <= k from InferWindowGroupLimit and > DenseRankLimitIterator > - > > Key: SPARK-8 > URL: https://issues.apache.org/jira/browse/SPARK-8 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > Top-k filters on a dense_rank() window function return wrong results, due to > a bug in optimization InferWindowGroupLimit, specifically in the code for > DenseRankLimitIterator, introduced in > https://issues.apache.org/jira/browse/SPARK-37099. > Repro: > {code:java} > create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, > 1), (2, 1), (2, 2); > select * from (select *, dense_rank() over (partition by p order by o) as rnk > from t1) where rnk = 1;{code} > Spark result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1]{code} > Correct result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1] > [2,1,1]{code} > > The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state > properly when transitioning from one window partition to the next. {{reset}} > only resets {{{}rank = 0{}}}, what it is missing is to reset > {{{}currentRankRow = null{}}}. This means that when processing the second and > later window partitions, the rank incorrectly gets incremented based on > comparing the ordering of the last row of the previous partition to the first > row of the new partition. > This means that a dense_rank window func that has more than one window > partition and more than one row with dense_rank = 1 in the second or later > partitions can give wrong results when optimized. > ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the > first row in the new partition will try to increment rank, but increment it > by the value of count which is 0, so it happens to work by accident). > Unfortunately, tests for the optimization only had a single row per rank, so > did not catch the bug as the bug requires multiple rows per rank. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator
[ https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-8: -- Description: Top-k filters on a dense_rank() window function return wrong results, due to a bug in optimization InferWindowGroupLimit, specifically in the code for DenseRankLimitIterator, introduced in https://issues.apache.org/jira/browse/SPARK-37099. Repro: {code:java} create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 1), (2, 1), (2, 2); select * from (select *, dense_rank() over (partition by p order by o) as rnk from t1) where rnk = 1;{code} Spark result: {code:java} [1,1,1] [1,1,1] [2,1,1]{code} Correct result: {code:java} [1,1,1] [1,1,1] [2,1,1] [2,1,1]{code} The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state properly when transitioning from one window partition to the next. {{reset}} only resets {{{}rank = 0{}}}, what it is missing is to reset {{{}currentRankRow = null{}}}. This means that when processing the second and later window partitions, the rank incorrectly gets incremented based on comparing the ordering of the last row of the previous partition to the first row of the new partition. This means that a dense_rank window func that has more than one window partition and more than one row with dense_rank = 1 in the second or later partitions can give wrong results when optimized. ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the first row in the new partition will try to increment rank, but increment it by the value of count which is 0, so it happens to work by accident). Unfortunately, tests for the optimization only had a single row per rank, so did not catch the bug as the bug requires multiple rows per rank. was: Top-k filters on a dense_rank() window function return wrong results, due to a bug in optimization InferWindowGroupLimit, specifically in the code for DenseRankLimitIterator. Repro: {code:java} create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 1), (2, 1), (2, 2); select * from (select *, dense_rank() over (partition by p order by o) as rnk from t1) where rnk = 1;{code} Spark result: {code:java} [1,1,1] [1,1,1] [2,1,1]{code} Correct result: {code:java} [1,1,1] [1,1,1] [2,1,1] [2,1,1]{code} The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state properly when transitioning from one window partition to the next. {{reset}} only resets {{{}rank = 0{}}}, what it is missing is to reset {{{}currentRankRow = null{}}}. This means that when processing the second and later window partitions, the rank incorrectly gets incremented based on comparing the ordering of the last row of the previous partition to the first row of the new partition. This means that a dense_rank window func that has more than one window partition and more than one row with dense_rank = 1 in the second or later partitions can give wrong results when optimized. ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the first row in the new partition will try to increment rank, but increment it by the value of count which is 0, so it happens to work by accident). Unfortunately, tests for the optimization only had a single row per rank, so did not catch the bug as the bug requires multiple rows per rank. > Wrong results for dense_rank() <= k from InferWindowGroupLimit and > DenseRankLimitIterator > - > > Key: SPARK-8 > URL: https://issues.apache.org/jira/browse/SPARK-8 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > Top-k filters on a dense_rank() window function return wrong results, due to > a bug in optimization InferWindowGroupLimit, specifically in the code for > DenseRankLimitIterator, introduced in > https://issues.apache.org/jira/browse/SPARK-37099. > Repro: > {code:java} > create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, > 1), (2, 1), (2, 2); > select * from (select *, dense_rank() over (partition by p order by o) as rnk > from t1) where rnk = 1;{code} > Spark result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1]{code} > Correct result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1] > [2,1,1]{code} > > The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state > properly when transitioning from one window partition to the next. {{reset}} > only resets {{{}rank = 0{}}}, what it is missing is to reset > {{{}currentRankRow = null{}}}. This means that when processing the second and > later window partitions, the rank incorrectly gets incremented based on > comparing the ordering of the last row of the previous partition to the first > row of the new
[jira] [Created] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator
Jack Chen created SPARK-8: - Summary: Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator Key: SPARK-8 URL: https://issues.apache.org/jira/browse/SPARK-8 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Jack Chen Top-k filters on a dense_rank() window function return wrong results, due to a bug in optimization InferWindowGroupLimit, specifically in the code for DenseRankLimitIterator. Repro: {code:java} create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 1), (2, 1), (2, 2); select * from (select *, dense_rank() over (partition by p order by o) as rnk from t1) where rnk = 1;{code} Spark result: {code:java} [1,1,1] [1,1,1] [2,1,1]{code} Correct result: {code:java} [1,1,1] [1,1,1] [2,1,1] [2,1,1]{code} The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state properly when transitioning from one window partition to the next. {{reset}} only resets {{{}rank = 0{}}}, what it is missing is to reset {{{}currentRankRow = null{}}}. This means that when processing the second and later window partitions, the rank incorrectly gets incremented based on comparing the ordering of the last row of the previous partition to the first row of the new partition. This means that a dense_rank window func that has more than one window partition and more than one row with dense_rank = 1 in the second or later partitions can give wrong results when optimized. ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the first row in the new partition will try to increment rank, but increment it by the value of count which is 0, so it happens to work by accident). Unfortunately, tests for the optimization only had a single row per rank, so did not catch the bug as the bug requires multiple rows per rank. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44431) Wrong semantics for null IN (empty list)
[ https://issues.apache.org/jira/browse/SPARK-44431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-44431: -- Description: {{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate to false. (The reason it should be false is because a IN (b1, b2) is defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is false. This is specified by ANSI SQL.) Many places in Spark execution (In, InSet, InSubquery) and optimization (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that the Spark behavior for the null IN (empty list) is inconsistent in some places - literal IN lists generally return null (incorrect), while IN/NOT IN subqueries mostly return false/true, respectively (correct) in this case. This is a longstanding correctness issue which has existed since null support for IN expressions was first added to Spark. Doc with more details: [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] was: {{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate to false. (The reason it should be false is because a IN (b1, b2) is defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is false. This is specified by ANSI SQL.) Many places in Spark execution (In, InSet, InSubquery) and optimization (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that the Spark behavior for the null IN (empty list) is inconsistent in some places - literal IN lists generally return null (incorrect), while IN/NOT IN subqueries mostly return false/true, respectively (correct) in this case. This is a longstanding correctness issue which has existed since null support for IN expressions was first added to Spark. Doc with more details: https://docs.google.com/document/d/15ttcB3OjGx5_WFKHB2COjQUbFHj5LrfNQv_d26o-wmI/edit > Wrong semantics for null IN (empty list) > > > Key: SPARK-44431 > URL: https://issues.apache.org/jira/browse/SPARK-44431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > {{null IN (empty list)}} incorrectly evaluates to null, when it should > evaluate to false. (The reason it should be false is because a IN (b1, b2) is > defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR > which is false. This is specified by ANSI SQL.) > Many places in Spark execution (In, InSet, InSubquery) and optimization > (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that > the Spark behavior for the null IN (empty list) is inconsistent in some > places - literal IN lists generally return null (incorrect), while IN/NOT IN > subqueries mostly return false/true, respectively (correct) in this case. > This is a longstanding correctness issue which has existed since null support > for IN expressions was first added to Spark. > Doc with more details: > [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44431) Wrong semantics for null IN (empty list)
Jack Chen created SPARK-44431: - Summary: Wrong semantics for null IN (empty list) Key: SPARK-44431 URL: https://issues.apache.org/jira/browse/SPARK-44431 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Jack Chen {{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate to false. (The reason it should be false is because a IN (b1, b2) is defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is false. This is specified by ANSI SQL.) Many places in Spark execution (In, InSet, InSubquery) and optimization (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that the Spark behavior for the null IN (empty list) is inconsistent in some places - literal IN lists generally return null (incorrect), while IN/NOT IN subqueries mostly return false/true, respectively (correct) in this case. This is a longstanding correctness issue which has existed since null support for IN expressions was first added to Spark. Doc with more details: https://docs.google.com/document/d/15ttcB3OjGx5_WFKHB2COjQUbFHj5LrfNQv_d26o-wmI/edit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull
[ https://issues.apache.org/jira/browse/SPARK-43596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-43596: -- Description: We can get a decorrelation error because of rewrites that run in between DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join conditions. In particular, ConstantFolding can transform `innercol <=> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize this and throws error Unable to rewrite domain join with conditions: ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it isn't usable for rewriting the domain join. Can fix by recognizing `isnull(innercol)` as `innercol <=> null` in rewriteDomainJoins. This area is also fragile in general and other rewrites that run between the two steps of decorrelation could potentially break their assumptions, so we may want to investigate longer-term follow ups for that. was: We can get a decorrelation error because of rewrites that run in between DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join conditions. In particular, ConstantFolding can transform `innercol <=> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize this and throws error Unable to rewrite domain join with conditions: ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it isn't usable for rewriting the domain join. Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins. This area is also fragile in general and other rewrites that run between the two steps of decorrelation could potentially break their assumptions, so we may want to investigate longer-term follow ups for that. > Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to > isnull > > > Key: SPARK-43596 > URL: https://issues.apache.org/jira/browse/SPARK-43596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > We can get a decorrelation error because of rewrites that run in between > DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation > join conditions. In particular, ConstantFolding can transform `innercol <=> > null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize > this and throws error Unable to rewrite domain join with conditions: > ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so > it isn't usable for rewriting the domain join. > Can fix by recognizing `isnull(innercol)` as `innercol <=> null` in > rewriteDomainJoins. > This area is also fragile in general and other rewrites that run between the > two steps of decorrelation could potentially break their assumptions, so we > may want to investigate longer-term follow ups for that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull
[ https://issues.apache.org/jira/browse/SPARK-43596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-43596: -- Description: We can get a decorrelation error because of rewrites that run in between DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join conditions. In particular, ConstantFolding can transform `innercol <=> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize this and throws error Unable to rewrite domain join with conditions: ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it isn't usable for rewriting the domain join. Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins. This area is also fragile in general and other rewrites that run between the two steps of decorrelation could potentially break their assumptions, so we may want to investigate longer-term follow ups for that. was: We can get a decorrelation error because of rewrites that run in between DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join conditions. In particular, ConstantFolding can transform `innercol <=> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize this and throws error Unable to rewrite domain join with conditions: ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it isn't usable for rewriting the domain join. Can fix by recognizing `isnull\(x)` as `x <=> null` in rewriteDomainJoins. This area is also fragile in general and other rewrites that run between the two steps of decorrelation could potentially break their assumptions, so we may want to investigate longer-term follow ups for that.}}{}}} > Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to > isnull > > > Key: SPARK-43596 > URL: https://issues.apache.org/jira/browse/SPARK-43596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > We can get a decorrelation error because of rewrites that run in between > DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation > join conditions. In particular, ConstantFolding can transform `innercol <=> > null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize > this and throws error Unable to rewrite domain join with conditions: > ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so > it isn't usable for rewriting the domain join. > Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins. > This area is also fragile in general and other rewrites that run between the > two steps of decorrelation could potentially break their assumptions, so we > may want to investigate longer-term follow ups for that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull
[ https://issues.apache.org/jira/browse/SPARK-43596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-43596: -- Description: We can get a decorrelation error because of rewrites that run in between DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join conditions. In particular, ConstantFolding can transform `innercol <=> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize this and throws error Unable to rewrite domain join with conditions: ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it isn't usable for rewriting the domain join. Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins. This area is also fragile in general and other rewrites that run between the two steps of decorrelation could potentially break their assumptions, so we may want to investigate longer-term follow ups for that.}}{}}} was: We can get a decorrelation error because of rewrites that run in between DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join conditions. In particular, ConstantFolding can transform `innercol <=> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize this and throws error Unable to rewrite domain join with conditions: ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it isn't usable for rewriting the domain join. Can fix by recognizing isnull(x) as x <=> null in rewriteDomainJoins. This area is also fragile in general and other rewrites that run between the two steps of decorrelation could potentially break their assumptions, so we may want to investigate longer-term follow ups for that.{{{}{}}} > Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to > isnull > > > Key: SPARK-43596 > URL: https://issues.apache.org/jira/browse/SPARK-43596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > We can get a decorrelation error because of rewrites that run in between > DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation > join conditions. In particular, ConstantFolding can transform `innercol <=> > null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize > this and throws error Unable to rewrite domain join with conditions: > ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so > it isn't usable for rewriting the domain join. > Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins. > This area is also fragile in general and other rewrites that run between the > two steps of decorrelation could potentially break their assumptions, so we > may want to investigate longer-term follow ups for that.}}{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull
Jack Chen created SPARK-43596: - Summary: Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull Key: SPARK-43596 URL: https://issues.apache.org/jira/browse/SPARK-43596 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Jack Chen We can get a decorrelation error because of rewrites that run in between DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join conditions. In particular, ConstantFolding can transform `innercol <=> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize this and throws error Unable to rewrite domain join with conditions: ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it isn't usable for rewriting the domain join. Can fix by recognizing isnull(x) as x <=> null in rewriteDomainJoins. This area is also fragile in general and other rewrites that run between the two steps of decorrelation could potentially break their assumptions, so we may want to investigate longer-term follow ups for that.{{{}{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull
[ https://issues.apache.org/jira/browse/SPARK-43596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-43596: -- Description: We can get a decorrelation error because of rewrites that run in between DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join conditions. In particular, ConstantFolding can transform `innercol <=> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize this and throws error Unable to rewrite domain join with conditions: ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it isn't usable for rewriting the domain join. Can fix by recognizing `isnull\(x)` as `x <=> null` in rewriteDomainJoins. This area is also fragile in general and other rewrites that run between the two steps of decorrelation could potentially break their assumptions, so we may want to investigate longer-term follow ups for that.}}{}}} was: We can get a decorrelation error because of rewrites that run in between DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join conditions. In particular, ConstantFolding can transform `innercol <=> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize this and throws error Unable to rewrite domain join with conditions: ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it isn't usable for rewriting the domain join. Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins. This area is also fragile in general and other rewrites that run between the two steps of decorrelation could potentially break their assumptions, so we may want to investigate longer-term follow ups for that.}}{}}} > Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to > isnull > > > Key: SPARK-43596 > URL: https://issues.apache.org/jira/browse/SPARK-43596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > We can get a decorrelation error because of rewrites that run in between > DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation > join conditions. In particular, ConstantFolding can transform `innercol <=> > null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize > this and throws error Unable to rewrite domain join with conditions: > ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so > it isn't usable for rewriting the domain join. > Can fix by recognizing `isnull\(x)` as `x <=> null` in rewriteDomainJoins. > This area is also fragile in general and other rewrites that run between the > two steps of decorrelation could potentially break their assumptions, so we > may want to investigate longer-term follow ups for that.}}{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43413) IN subquery ListQuery has wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-43413: -- Description: IN subquery expressions are incorrectly marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. was: IN subquery expressions are incorrectly always marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. > IN subquery ListQuery has wrong nullability > --- > > Key: SPARK-43413 > URL: https://issues.apache.org/jira/browse/SPARK-43413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > IN subquery expressions are incorrectly marked as non-nullable, even when > they are actually nullable. They correctly check the nullability of the > left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is > currently defined with nullability = false always. This is incorrect and can > lead to incorrect query transformations. > Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN > expression returns NULL when the nullable_col is null, but our code marks it > as non-nullable, and therefore SimplifyBinaryComparison transforms away the > <=> TRUE, transforming the expression to non_nullable_col IN (select > nullable_col) , which is an incorrect transformation because NULL values of > nullable_col now cause the expression to yield NULL instead of FALSE. > This bug can potentially lead to wrong results, but in most cases this > doesn't directly cause wrong results end-to-end, because IN subqueries are > almost always transformed to semi/anti/existence joins in > RewritePredicateSubquery, and this rewrite can also incorrectly discard > NULLs, which is another bug. But we can observe it causing wrong behavior in > unit tests, and it could easily lead to incorrect query results if there are > changes to the surrounding context, so it should be fixed regardless. > This is a long-standing bug that has existed at least since 2016, as long as > the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (SPARK-43413) IN subquery ListQuery has wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-43413: -- Description: IN subquery expressions are incorrectly always marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. was: IN subquery expressions currently are marked as nullable if and only if the left-hand-side is nullable - because the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. > IN subquery ListQuery has wrong nullability > --- > > Key: SPARK-43413 > URL: https://issues.apache.org/jira/browse/SPARK-43413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > IN subquery expressions are incorrectly always marked as non-nullable, even > when they are actually nullable. They correctly check the nullability of the > left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is > currently defined with nullability = false always. This is incorrect and can > lead to incorrect query transformations. > Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN > expression returns NULL when the nullable_col is null, but our code marks it > as non-nullable, and therefore SimplifyBinaryComparison transforms away the > <=> TRUE, transforming the expression to non_nullable_col IN (select > nullable_col) , which is an incorrect transformation because NULL values of > nullable_col now cause the expression to yield NULL instead of FALSE. > This bug can potentially lead to wrong results, but in most cases this > doesn't directly cause wrong results end-to-end, because IN subqueries are > almost always transformed to semi/anti/existence joins in > RewritePredicateSubquery, and this rewrite can also incorrectly discard > NULLs, which is another bug. But we can observe it causing wrong behavior in > unit tests, and it could easily lead to incorrect query results if there are > changes to the surrounding context, so it should be fixed regardless. > This is a long-standing bug that has existed at least since 2016, as long as > the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Updated] (SPARK-43413) IN subquery ListQuery has wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-43413: -- Description: IN subquery expressions currently are marked as nullable if and only if the left-hand-side is nullable - because the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. was: IN subquery expressions currently are marked as nullable if and only if the left-hand-side is nullable - because the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. > IN subquery ListQuery has wrong nullability > --- > > Key: SPARK-43413 > URL: https://issues.apache.org/jira/browse/SPARK-43413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > IN subquery expressions currently are marked as nullable if and only if the > left-hand-side is nullable - because the right-hand-side of a IN subquery, > the ListQuery, is currently defined with nullability = false always. This is > incorrect and can lead to incorrect query transformations. > Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN > expression returns NULL when the nullable_col is null, but our code marks it > as non-nullable, and therefore SimplifyBinaryComparison transforms away the > <=> TRUE, transforming the expression to non_nullable_col IN (select > nullable_col) , which is an incorrect transformation because NULL values of > nullable_col now cause the expression to yield NULL instead of FALSE. > This bug can potentially lead to wrong results, but in most cases this > doesn't directly cause wrong results end-to-end, because IN subqueries are > almost always transformed to semi/anti/existence joins in > RewritePredicateSubquery, and this rewrite can also incorrectly discard > NULLs, which is another bug. But we can observe it causing wrong behavior in > unit tests, and it could easily lead to incorrect query results if there are > changes to the surrounding context, so it should be fixed regardless. > This is a long-standing bug that has existed at least since 2016, as long as > the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43413) IN subquery ListQuery has wrong nullability
Jack Chen created SPARK-43413: - Summary: IN subquery ListQuery has wrong nullability Key: SPARK-43413 URL: https://issues.apache.org/jira/browse/SPARK-43413 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Jack Chen IN subquery expressions currently are marked as nullable if and only if the left-hand-side is nullable - because the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
[ https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716952#comment-17716952 ] Jack Chen commented on SPARK-43156: --- My above pull request only added a related test. Hisoka-X's PR [https://github.com/apache/spark/pull/40865] is in progress with the fix. > Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null` > > > Key: SPARK-43156 > URL: https://issues.apache.org/jira/browse/SPARK-43156 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.5.0 > > > Example query: > {code:java} > spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) > from t0").collect() > res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false]) > {code} > In this subquery, count(1) always evaluates to a non-null integer value, so > count(1) is null is always false. The correct evaluation of the subquery is > always false. > We incorrectly evaluate it to null for empty groups. The reason is that > NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] > [false] - this rewrite would be correct normally, but in the context of a > scalar subquery it breaks our count bug handling in > RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get > there, the query appears to not have the count bug - it looks the same as if > the original query had a subquery with select any_value(false) from r..., and > that case is _not_ subject to the count bug. > > Postgres comparison show correct always-false result: > [http://sqlfiddle.com/#!17/67822/5] > DDL for the example: > {code:java} > create or replace temp view t0 (a, b) > as values > (1, 1.0), > (2, 2.0); > create or replace temp view t1 (c, d) > as values > (2, 3.0); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
[ https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen reopened SPARK-43156: --- > Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null` > > > Key: SPARK-43156 > URL: https://issues.apache.org/jira/browse/SPARK-43156 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.5.0 > > > Example query: > {code:java} > spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) > from t0").collect() > res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false]) > {code} > In this subquery, count(1) always evaluates to a non-null integer value, so > count(1) is null is always false. The correct evaluation of the subquery is > always false. > We incorrectly evaluate it to null for empty groups. The reason is that > NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] > [false] - this rewrite would be correct normally, but in the context of a > scalar subquery it breaks our count bug handling in > RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get > there, the query appears to not have the count bug - it looks the same as if > the original query had a subquery with select any_value(false) from r..., and > that case is _not_ subject to the count bug. > > Postgres comparison show correct always-false result: > [http://sqlfiddle.com/#!17/67822/5] > DDL for the example: > {code:java} > create or replace temp view t0 (a, b) > as values > (1, 1.0), > (2, 2.0); > create or replace temp view t1 (c, d) > as values > (2, 3.0); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
[ https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-43156: -- Description: Example query: {code:java} spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) from t0").collect() res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false]) {code} In this subquery, count(1) always evaluates to a non-null integer value, so count(1) is null is always false. The correct evaluation of the subquery is always false. We incorrectly evaluate it to null for empty groups. The reason is that NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] [false] - this rewrite would be correct normally, but in the context of a scalar subquery it breaks our count bug handling in RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, the query appears to not have the count bug - it looks the same as if the original query had a subquery with select any_value(false) from r..., and that case is _not_ subject to the count bug. Postgres comparison show correct always-false result: [http://sqlfiddle.com/#!17/67822/5] DDL for the example: {code:java} create or replace temp view t0 (a, b) as values (1, 1.0), (2, 2.0); create or replace temp view t1 (c, d) as values (2, 3.0); {code} was: Example query: {code:java} spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) from t0").collect() res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false]) {code} In this subquery, count(1) always evaluates to a non-null integer value, so count(1) is null is always false. The correct evaluation of the subquery is always false. We incorrectly evaluate it to null for empty groups. The reason is that NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] [false] - this rewrite would be correct normally, but in the context of a scalar subquery it breaks our count bug handling in RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, the query appears to not have the count bug - it looks the same as if the original query had a subquery with select any_value(false) from r..., and that case is _not_ subject to the count bug. Postgres comparison show correct always-false result: [http://sqlfiddle.com/#!17/67822/5] DDL: {code:java} create or replace temp view t0 (a, b) as values (1, 1.0), (2, 2.0); create or replace temp view t1 (c, d) as values (2, 3.0); {code} > Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null` > > > Key: SPARK-43156 > URL: https://issues.apache.org/jira/browse/SPARK-43156 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > Example query: > {code:java} > spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) > from t0").collect() > res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false]) > {code} > In this subquery, count(1) always evaluates to a non-null integer value, so > count(1) is null is always false. The correct evaluation of the subquery is > always false. > We incorrectly evaluate it to null for empty groups. The reason is that > NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] > [false] - this rewrite would be correct normally, but in the context of a > scalar subquery it breaks our count bug handling in > RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get > there, the query appears to not have the count bug - it looks the same as if > the original query had a subquery with select any_value(false) from r..., and > that case is _not_ subject to the count bug. > > Postgres comparison show correct always-false result: > [http://sqlfiddle.com/#!17/67822/5] > DDL for the example: > {code:java} > create or replace temp view t0 (a, b) > as values > (1, 1.0), > (2, 2.0); > create or replace temp view t1 (c, d) > as values > (2, 3.0); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
[ https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-43156: -- Description: Example query: {code:java} spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) from t0").collect() res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false]) {code} In this subquery, count(1) always evaluates to a non-null integer value, so count(1) is null is always false. The correct evaluation of the subquery is always false. We incorrectly evaluate it to null for empty groups. The reason is that NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] [false] - this rewrite would be correct normally, but in the context of a scalar subquery it breaks our count bug handling in RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, the query appears to not have the count bug - it looks the same as if the original query had a subquery with select any_value(false) from r..., and that case is _not_ subject to the count bug. Postgres comparison show correct always-false result: [http://sqlfiddle.com/#!17/67822/5] DDL: {code:java} create or replace temp view t0 (a, b) as values (1, 1.0), (2, 2.0); create or replace temp view t1 (c, d) as values (2, 3.0); {code} was: Example query: {code:java} spark.sql("select *, (select (count(*)) is null from t1 where t0.a = t1.c) from t0").collect() res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false]) {code} In this subquery, count(*) always evaluates to a non-null integer value, so count(*) is null is always false. The correct evaluation of the subquery is always false. We incorrectly evaluate it to null for empty groups. The reason is that NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] [false] - this rewrite would be correct normally, but in the context of a scalar subquery it breaks our count bug handling in RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, the query appears to not have the count bug - it looks the same as if the original query had a subquery with select any_value(false) from r..., and that case is _not_ subject to the count bug. Postgres comparison show correct always-false result: [http://sqlfiddle.com/#!17/67822/5] DDL: {code:java} create or replace temp view t0 (a, b) as values (1, 1.0), (2, 2.0); create or replace temp view t1 (c, d) as values (2, 3.0); {code} > Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null` > > > Key: SPARK-43156 > URL: https://issues.apache.org/jira/browse/SPARK-43156 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > Example query: > {code:java} > spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) > from t0").collect() > res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false]) > {code} > In this subquery, count(1) always evaluates to a non-null integer value, so > count(1) is null is always false. The correct evaluation of the subquery is > always false. > We incorrectly evaluate it to null for empty groups. The reason is that > NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] > [false] - this rewrite would be correct normally, but in the context of a > scalar subquery it breaks our count bug handling in > RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get > there, the query appears to not have the count bug - it looks the same as if > the original query had a subquery with select any_value(false) from r..., and > that case is _not_ subject to the count bug. > > Postgres comparison show correct always-false result: > [http://sqlfiddle.com/#!17/67822/5] > DDL: > {code:java} > create or replace temp view t0 (a, b) > as values > (1, 1.0), > (2, 2.0); > create or replace temp view t1 (c, d) > as values > (2, 3.0); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
Jack Chen created SPARK-43156: - Summary: Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null` Key: SPARK-43156 URL: https://issues.apache.org/jira/browse/SPARK-43156 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Jack Chen Example query: {code:java} spark.sql("select *, (select (count(*)) is null from t1 where t0.a = t1.c) from t0").collect() res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false]) {code} In this subquery, count(*) always evaluates to a non-null integer value, so count(*) is null is always false. The correct evaluation of the subquery is always false. We incorrectly evaluate it to null for empty groups. The reason is that NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] [false] - this rewrite would be correct normally, but in the context of a scalar subquery it breaks our count bug handling in RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, the query appears to not have the count bug - it looks the same as if the original query had a subquery with select any_value(false) from r..., and that case is _not_ subject to the count bug. Postgres comparison show correct always-false result: [http://sqlfiddle.com/#!17/67822/5] DDL: {code:java} create or replace temp view t0 (a, b) as values (1, 1.0), (2, 2.0); create or replace temp view t1 (c, d) as values (2, 3.0); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43098) Should not handle the COUNT bug when the GROUP BY clause of a correlated scalar subquery is non-empty
Jack Chen created SPARK-43098: - Summary: Should not handle the COUNT bug when the GROUP BY clause of a correlated scalar subquery is non-empty Key: SPARK-43098 URL: https://issues.apache.org/jira/browse/SPARK-43098 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Jack Chen >From [~allisonwang-db] : There is no COUNT bug when the correlated equality predicates are also in the group by clause. However, the current logic to handle the COUNT bug still adds default aggregate function value and returns incorrect results. {code:java} create view t1(c1, c2) as values (0, 1), (1, 2); create view t2(c1, c2) as values (0, 2), (0, 3); select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from t1; -- Correct answer: [(0, 1, 2), (1, 2, null)] +---+---+--+ |c1 |c2 |scalarsubquery(c1)| +---+---+--+ |0 |1 |2 | |1 |2 |0 | +---+---+--+ {code} This bug affects scalar subqueries in RewriteCorrelatedScalarSubquery, but lateral subqueries handle it correctly in DecorrelateInnerQuery. Related: https://issues.apache.org/jira/browse/SPARK-36113 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2901) it's a test
Jack chen created SPARK-2901: Summary: it's a test Key: SPARK-2901 URL: https://issues.apache.org/jira/browse/SPARK-2901 Project: Spark Issue Type: Bug Reporter: Jack chen Priority: Trivial -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org