[jira] [Created] (SPARK-48557) Support scalar subquery with group-by on column equal to constant

2024-06-06 Thread Jack Chen (Jira)
Jack Chen created SPARK-48557:
-

 Summary: Support scalar subquery with group-by on column equal to 
constant
 Key: SPARK-48557
 URL: https://issues.apache.org/jira/browse/SPARK-48557
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jack Chen


For example, this query is currently unsupported but can be allowed, because 
the filter y2 = 1 guarantees that the group-by has only one value.
{code:java}
select *, (select count(*) from y where x1 = y1 and y2 = 1 group by y2) from x; 
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48503) Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results

2024-06-02 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-48503:
--
Parent: SPARK-35553
Issue Type: Sub-task  (was: Bug)

> Scalar subquery with group-by and non-equality predicate incorrectly allowed, 
> wrong results
> ---
>
> Key: SPARK-48503
> URL: https://issues.apache.org/jira/browse/SPARK-48503
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Priority: Major
>
> This query is not legal and should give an error, but instead we incorrectly 
> allow it and it returns wrong results.
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1);
> create table y(y1 int, y2 int);
> insert into y values (2, 2), (3, 3);
> select *, (select count(*) from y where y1 > x1 group by y1) from x; {code}
> It returns two rows, even though there's only one row of x.
> The correct result is an error: more than one row returned by a subquery used 
> as an expression (as seen in postgres for example)
>  
> This is a longstanding bug. The bug is in CheckAnalysis in 
> {{{}checkAggregateInScalarSubquery{}}}. It allows grouping columns that are 
> present in correlation predicates, but doesn’t check whether those predicates 
> are equalities -  because when that code was written, non-equality 
> correlation wasn’t allowed. Therefore, it looks like this bug has existed 
> since non-equality correlation was added (~2 years ago).
>  
> Various other expressions that are not equi-joins between the inner and outer 
> fields hit this too, e.g. `where y1 + y2 = x1 group by y1`.
> Another bugged case is if the correlation condition is an equality but it's 
> under another operator like an OUTER JOIN or UNION.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48501) Loosen `correlated scalar subqueries must be aggregated` error by doing runtime check for scalar subqueries output rowcount

2024-06-02 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-48501:
--
Description: 
Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw 
an error {{{}Correlated scalar subqueries must be aggregated{}}}.

This check is often too restrictive, there are many cases where it should 
actually be runnable even though we don’t know it - e.g. unique keys or 
functional dependencies might ensure that there's only one row.

To handle these cases, it’s better to do the check at runtime instead of 
statically. This could be implemented as a special aggregate operator that 
throws exception on >=2 rows input, a “single join” operator that throws an 
exception when >= 2 rows match, or something similar.

There are also cases where we were incorrectly allowing queries before that 
returned wrong results, and should have been rejected as invalid (e.g. 
SPARK-48503, SPARK-18504). Doing the check at runtime would help avoid those 
bugs.

Current workarounds: Users can add an aggregate like {{any_value()}} or 
{{first()}} to the output of the subquery, or users can add {{limit 1}}

  was:
Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw 
an error {{{}Correlated scalar subqueries must be aggregated{}}}.

This check is often too restrictive, there are many cases where it should 
actually be runnable even though we don’t know it - e.g. unique keys or 
functional dependencies might ensure that there's only one row.

To handle these cases, it’s better to do the check at runtime instead of 
statically. This could be implemented as a special aggregate operator that 
throws exception on >=2 rows input, a “single join” operator that throws an 
exception when >= 2 rows match, or something similar.

There are also cases where we were incorrectly allowing queries before that 
returned wrong results, and should have been rejected as invalid. Doing the 
check at runtime would help avoid those bugs.

Current workarounds: Users can add an aggregate like {{any_value()}} or 
{{first()}} to the output of the subquery, or users can add {{limit 1}}


> Loosen `correlated scalar subqueries must be aggregated` error by doing 
> runtime check for scalar subqueries output rowcount
> ---
>
> Key: SPARK-48501
> URL: https://issues.apache.org/jira/browse/SPARK-48501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Priority: Major
>
> Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw 
> an error {{{}Correlated scalar subqueries must be aggregated{}}}.
> This check is often too restrictive, there are many cases where it should 
> actually be runnable even though we don’t know it - e.g. unique keys or 
> functional dependencies might ensure that there's only one row.
> To handle these cases, it’s better to do the check at runtime instead of 
> statically. This could be implemented as a special aggregate operator that 
> throws exception on >=2 rows input, a “single join” operator that throws an 
> exception when >= 2 rows match, or something similar.
> There are also cases where we were incorrectly allowing queries before that 
> returned wrong results, and should have been rejected as invalid (e.g. 
> SPARK-48503, SPARK-18504). Doing the check at runtime would help avoid those 
> bugs.
> Current workarounds: Users can add an aggregate like {{any_value()}} or 
> {{first()}} to the output of the subquery, or users can add {{limit 1}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48503) Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results

2024-06-02 Thread Jack Chen (Jira)
Jack Chen created SPARK-48503:
-

 Summary: Scalar subquery with group-by and non-equality predicate 
incorrectly allowed, wrong results
 Key: SPARK-48503
 URL: https://issues.apache.org/jira/browse/SPARK-48503
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jack Chen


This query is not legal and should give an error, but instead we incorrectly 
allow it and it returns wrong results.
{code:java}
create table x(x1 int, x2 int);
insert into x values (1, 1);
create table y(y1 int, y2 int);
insert into y values (2, 2), (3, 3);

select *, (select count(*) from y where y1 > x1 group by y1) from x; {code}
It returns two rows, even though there's only one row of x.

The correct result is an error: more than one row returned by a subquery used 
as an expression (as seen in postgres for example)

 

This is a longstanding bug. The bug is in CheckAnalysis in 
{{{}checkAggregateInScalarSubquery{}}}. It allows grouping columns that are 
present in correlation predicates, but doesn’t check whether those predicates 
are equalities -  because when that code was written, non-equality correlation 
wasn’t allowed. Therefore, it looks like this bug has existed since 
non-equality correlation was added (~2 years ago).

 

Various other expressions that are not equi-joins between the inner and outer 
fields hit this too, e.g. `where y1 + y2 = x1 group by y1`.

Another bugged case is if the correlation condition is an equality but it's 
under another operator like an OUTER JOIN or UNION.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48501) Loosen `correlated scalar subqueries must be aggregated` error by doing runtime check for scalar subqueries output rowcount

2024-06-02 Thread Jack Chen (Jira)
Jack Chen created SPARK-48501:
-

 Summary: Loosen `correlated scalar subqueries must be aggregated` 
error by doing runtime check for scalar subqueries output rowcount
 Key: SPARK-48501
 URL: https://issues.apache.org/jira/browse/SPARK-48501
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jack Chen


Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw 
an error {{{}Correlated scalar subqueries must be aggregated{}}}.

This check is often too restrictive, there are many cases where it should 
actually be runnable even though we don’t know it - e.g. unique keys or 
functional dependencies might ensure that there's only one row.

To handle these cases, it’s better to do the check at runtime instead of 
statically. This could be implemented as a special aggregate operator that 
throws exception on >=2 rows input, a “single join” operator that throws an 
exception when >= 2 rows match, or something similar.

There are also cases where we were incorrectly allowing queries before that 
returned wrong results, and should have been rejected as invalid. Doing the 
check at runtime would help avoid those bugs.

Current workarounds: Users can add an aggregate like {{any_value()}} or 
{{first()}} to the output of the subquery, or users can add {{limit 1}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36115) Handle the COUNT bug for correlated IN/EXISTS subquery

2024-06-02 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen resolved SPARK-36115.
---
Resolution: Fixed

This was fixed by https://github.com/apache/spark/pull/43111

> Handle the COUNT bug for correlated IN/EXISTS subquery
> --
>
> Key: SPARK-36115
> URL: https://issues.apache.org/jira/browse/SPARK-36115
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Correlated IN/EXISTS subqueries are also subject to the COUNT bug which is 
> not handled.
> {code:sql}
> create view t1(c1, c2) as values (0, 1), (1, 2)
> create view t2(c1, c2) as values (0, 2), (0, 3)
> -- Example 1: IN subquery
> select * from t1 where c1 in (select count(*) + 1 from t2 where t1.c1 = t2.c1)
> -- Correct answer: (1, 2)
> +---+---+
> |c1 |c2 |
> +---+---+
> +---+---+
> -- Example 2: EXISTS subquery
> select * from t1 where exists (select count(*) from t2 where t1.c1 = t2.c1)
> -- Correct answer: [(0, 1), (1, 2)]
> +---+---+
> |c1 |c2 |
> +---+---+
> |0  |1  |
> +---+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47716) SQLQueryTestSuite flaky case due to view name conflict

2024-04-03 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-47716:
--
Description: 
In SQLQueryTestSuite, the test case "Test logic for determining whether a query 
is semantically sorted" can sometimes fail with an error

{{Cannot create table or view `main`.`default`.`t1` because it already exists.}}

if run concurrently with other sql test cases that also create tables with the 
same name.

  was:
In SQLQueryTestSuite, the test case "Test logic for determining whether a query 
is semantically sorted" sometimes has a flaky failure with an error

{{Cannot create table or view `main`.`default`.`t1` because it already exists.}}

{{{}{}}}This seems to be due to sql test cases that also create tables with the 
same name.


> SQLQueryTestSuite flaky case due to view name conflict
> --
>
> Key: SPARK-47716
> URL: https://issues.apache.org/jira/browse/SPARK-47716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Priority: Major
>  Labels: pull-request-available
>
> In SQLQueryTestSuite, the test case "Test logic for determining whether a 
> query is semantically sorted" can sometimes fail with an error
> {{Cannot create table or view `main`.`default`.`t1` because it already 
> exists.}}
> if run concurrently with other sql test cases that also create tables with 
> the same name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47716) SQLQueryTestSuite flaky case due to view name conflict

2024-04-03 Thread Jack Chen (Jira)
Jack Chen created SPARK-47716:
-

 Summary: SQLQueryTestSuite flaky case due to view name conflict
 Key: SPARK-47716
 URL: https://issues.apache.org/jira/browse/SPARK-47716
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jack Chen


In SQLQueryTestSuite, the test case "Test logic for determining whether a query 
is semantically sorted" sometimes has a flaky failure with an error

{{Cannot create table or view `main`.`default`.`t1` because it already exists.}}

{{{}{}}}This seems to be due to sql test cases that also create tables with the 
same name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47700) Fix formatting of error messages with treeNode

2024-04-02 Thread Jack Chen (Jira)
Jack Chen created SPARK-47700:
-

 Summary: Fix formatting of error messages with treeNode
 Key: SPARK-47700
 URL: https://issues.apache.org/jira/browse/SPARK-47700
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jack Chen


Example: We are concatenating the plan without any separation: `in this 
locationFilter (dept_id#652`. We should add a colon and space or newline in 
between:
{code:java}
[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED]
 Unsupported subquery expression: Accessing outer query column is not allowed 
in this locationFilter (dept_id#652 = outer(dept_id#646))
+- SubqueryAlias dept
   +- View (`DEPT`, [dept_id#652, dept_name#653, state#654])
  +- Project [cast(dept_id#655 as int) AS dept_id#652, cast(dept_name#656 
as string) AS dept_name#653, cast(state#657 as string) AS state#654]
 +- Project [dept_id#655, dept_name#656, state#657]
+- SubqueryAlias DEPT
   +- LocalRelation [dept_id#655, dept_name#656, state#657]
. SQLSTATE: 0A000 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47525) Support subquery correlation joining on map attributes

2024-03-22 Thread Jack Chen (Jira)
Jack Chen created SPARK-47525:
-

 Summary: Support subquery correlation joining on map attributes
 Key: SPARK-47525
 URL: https://issues.apache.org/jira/browse/SPARK-47525
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jack Chen


Currently, when a subquery is correlated on a condition like `outer_map[1] = 
inner_map[1]`, DecorrelateInnerQuery generates a join on the map itself,
which is unsupported, so the query cannot run - for example:
 
 {{scala> Seq(Map(0 -> 0)).toDF.createOrReplaceTempView("v")scala> sql("select 
v1.value[0] from v v1 where v1.value[0] > (select avg(v2.value[0]) from v v2 
where v1.value[1] = v2.value[1])").explain
org.apache.spark.sql.AnalysisException: 
[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE]
 Unsupported subquery expression: Correlated column reference 'v1.value' cannot 
be map type. SQLSTATE: 0A000; line 1 pos 49
  at 
org.apache.spark.sql.errors.QueryCompilationErrors$.unsupportedCorrelatedReferenceDataTypeError(QueryCompilationErrors.scala:2463)
  ...}}
However, if we rewrite the query to pull out the map access `outer_map[1]` into 
the outer plan, it succeeds:
 
 {{scala> sql("""with tmp as (
  select value[0] as value0, value[1] as value1 from v
)
select v1.value0 from tmp v1 where v1.value0 > (select avg(v2.value0) from tmp 
v2 where v1.value1 = v2.value1)""").explain}}
 

Another point that can be improved is that, even if the data type supports 
join, we still don’t need to join on the full attribute, and we can get a 
better plan by doing the same rewrite to pull out the extract expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47525) Support subquery correlation joining on map attributes

2024-03-22 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-47525:
--
Description: 
Currently, when a subquery is correlated on a condition like `outer_map[1] = 
inner_map[1]`, DecorrelateInnerQuery generates a join on the map itself,
which is unsupported, so the query cannot run - for example:
 
{code:java}
scala> Seq(Map(0 -> 0)).toDF.createOrReplaceTempView("v")scala> sql("select 
v1.value[0] from v v1 where v1.value[0] > (select avg(v2.value[0]) from v v2 
where v1.value[1] = v2.value[1])").explain
org.apache.spark.sql.AnalysisException: 
[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE]
 Unsupported subquery expression: Correlated column reference 'v1.value' cannot 
be map type. SQLSTATE: 0A000; line 1 pos 49
at 
org.apache.spark.sql.errors.QueryCompilationErrors$.unsupportedCorrelatedReferenceDataTypeError(QueryCompilationErrors.scala:2463)
... {code}

However, if we rewrite the query to pull out the map access `outer_map[1]` into 
the outer plan, it succeeds:
 
{code:java}
scala> sql("""with tmp as (
select value[0] as value0, value[1] as value1 from v
)
select v1.value0 from tmp v1 where v1.value0 > (select avg(v2.value0) from tmp 
v2 where v1.value1 = v2.value1)""").explain{code}
Another point that can be improved is that, even if the data type supports 
join, we still don’t need to join on the full attribute, and we can get a 
better plan by doing the same rewrite to pull out the extract expression.

  was:
Currently, when a subquery is correlated on a condition like `outer_map[1] = 
inner_map[1]`, DecorrelateInnerQuery generates a join on the map itself,
which is unsupported, so the query cannot run - for example:
 
 {{scala> Seq(Map(0 -> 0)).toDF.createOrReplaceTempView("v")scala> sql("select 
v1.value[0] from v v1 where v1.value[0] > (select avg(v2.value[0]) from v v2 
where v1.value[1] = v2.value[1])").explain
org.apache.spark.sql.AnalysisException: 
[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE]
 Unsupported subquery expression: Correlated column reference 'v1.value' cannot 
be map type. SQLSTATE: 0A000; line 1 pos 49
  at 
org.apache.spark.sql.errors.QueryCompilationErrors$.unsupportedCorrelatedReferenceDataTypeError(QueryCompilationErrors.scala:2463)
  ...}}
However, if we rewrite the query to pull out the map access `outer_map[1]` into 
the outer plan, it succeeds:
 
 {{scala> sql("""with tmp as (
  select value[0] as value0, value[1] as value1 from v
)
select v1.value0 from tmp v1 where v1.value0 > (select avg(v2.value0) from tmp 
v2 where v1.value1 = v2.value1)""").explain}}
 

Another point that can be improved is that, even if the data type supports 
join, we still don’t need to join on the full attribute, and we can get a 
better plan by doing the same rewrite to pull out the extract expression.


> Support subquery correlation joining on map attributes
> --
>
> Key: SPARK-47525
> URL: https://issues.apache.org/jira/browse/SPARK-47525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jack Chen
>Priority: Major
>
> Currently, when a subquery is correlated on a condition like `outer_map[1] = 
> inner_map[1]`, DecorrelateInnerQuery generates a join on the map itself,
> which is unsupported, so the query cannot run - for example:
>  
> {code:java}
> scala> Seq(Map(0 -> 0)).toDF.createOrReplaceTempView("v")scala> sql("select 
> v1.value[0] from v v1 where v1.value[0] > (select avg(v2.value[0]) from v v2 
> where v1.value[1] = v2.value[1])").explain
> org.apache.spark.sql.AnalysisException: 
> [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE]
>  Unsupported subquery expression: Correlated column reference 'v1.value' 
> cannot be map type. SQLSTATE: 0A000; line 1 pos 49
> at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.unsupportedCorrelatedReferenceDataTypeError(QueryCompilationErrors.scala:2463)
> ... {code}
> However, if we rewrite the query to pull out the map access `outer_map[1]` 
> into the outer plan, it succeeds:
>  
> {code:java}
> scala> sql("""with tmp as (
> select value[0] as value0, value[1] as value1 from v
> )
> select v1.value0 from tmp v1 where v1.value0 > (select avg(v2.value0) from 
> tmp v2 where v1.value1 = v2.value1)""").explain{code}
> Another point that can be improved is that, even if the data type supports 
> join, we still don’t need to join on the full attribute, and we can get a 
> better plan by doing the same rewrite to pull out the extract expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Comment Edited] (SPARK-36113) Unify the logic to handle COUNT bug for scalar and lateral subqueries

2024-03-07 Thread Jack Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824521#comment-17824521
 ] 

Jack Chen edited comment on SPARK-36113 at 3/7/24 8:52 PM:
---

The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row - that's added in constructLeftJoins. 
And if we handle count bug in DecorrelateInnerQuery using the existing logic, 
it would create another left join, which may be in the middle of the subquery 
plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join. Perhaps constructLeftJoins can check if there's a left DomainJoin 
at the top of the subquery plan and flip it to inner - but then we also need to 
make sure the expression for count bug handling {{if(isnull...), 0, cnt)}} is 
in the right place, in the newly built outer join.

In general, I think we want to move the logic for determining whether there's a 
count bug earlier to DecorrelateInnerQuery, which unifies the logic between 
scalar and non-scalar subqueries, and allows us to avoid a whole class of bugs 
with the doing the scalar subquery count bug handling later - see e.g. 
[https://github.com/apache/spark/pull/40811] and 
[https://github.com/apache/spark/pull/45125]. And see 
[https://docs.google.com/document/d/1YCce0DtJ6NkMP1QM1nnfYvkiH1CkHoGyMjKQ8MX5bUM/edit]
 for more on why this decorrelation happens in two steps.


was (Author: JIRAUSER299035):
The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row - that's added in constructLeftJoins. 
And if we handle count bug in DecorrelateInnerQuery using the existing logic, 
it would create another left join, which may be in the middle of the subquery 
plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join. Perhaps constructLeftJoins can check if there's a left DomainJoin 
at the top of the subquery plan and flip it to inner. This way all the logic 
for determining whether there's a count bug happens earlier in 
DecorrelateInnerQuery, which unifies the logic and allows us to avoid a whole 
class of bugs with the doing the scalar subquery count bug handling later - see 
e.g. [https://github.com/apache/spark/pull/40811] and 
[https://github.com/apache/spark/pull/45125]. And see 
[https://docs.google.com/document/d/1YCce0DtJ6NkMP1QM1nnfYvkiH1CkHoGyMjKQ8MX5bUM/edit]
 for more on why this decorrelation happens in two steps.

> Unify the logic to handle COUNT bug for scalar and lateral subqueries
> -
>
> Key: SPARK-36113
> URL: https://issues.apache.org/jira/browse/SPARK-36113
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, for scalar subqueries, the logic to handle the count bug is in 
> `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral 
> subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. 
> We should unify them without introducing additional left outer joins for 
> scalar subqueries.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36113) Unify the logic to handle COUNT bug for scalar and lateral subqueries

2024-03-07 Thread Jack Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824521#comment-17824521
 ] 

Jack Chen edited comment on SPARK-36113 at 3/7/24 8:47 PM:
---

The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row - that's added in constructLeftJoins. 
And if we handle count bug in DecorrelateInnerQuery using the existing logic, 
it would create another left join, which may be in the middle of the subquery 
plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join. Perhaps constructLeftJoins can check if there's a left DomainJoin 
at the top of the subquery plan and flip it to inner. This way all the logic 
for determining whether there's a count bug happens earlier in 
DecorrelateInnerQuery, which unifies the logic and allows us to avoid a whole 
class of bugs with the doing the scalar subquery count bug handling later - see 
e.g. [https://github.com/apache/spark/pull/40811] and 
[https://github.com/apache/spark/pull/45125]. And see 
[https://docs.google.com/document/d/1YCce0DtJ6NkMP1QM1nnfYvkiH1CkHoGyMjKQ8MX5bUM/edit]
 for more on why this decorrelation happens in two steps.


was (Author: JIRAUSER299035):
The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row - that's added in constructLeftJoins. 
And if we handle count bug in DecorrelateInnerQuery using the existing logic, 
it would create another left join, which may be in the middle of the subquery 
plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join. Adding that outer join still needs to happen in the same place 
(rewriteDomainJoins), but perhaps we can move the logic of checking whether the 
subquery is affected by count bug all to DecorrelateInnerQuery, and that just 
leaves a flag for rewriteDomainJoins/constructLeftJoins later on.

> Unify the logic to handle COUNT bug for scalar and lateral subqueries
> -
>
> Key: SPARK-36113
> URL: https://issues.apache.org/jira/browse/SPARK-36113
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, for scalar subqueries, the logic to handle the count bug is in 
> `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral 
> subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. 
> We should unify them without introducing additional left outer joins for 
> scalar subqueries.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36113) Unify the logic to handle COUNT bug for scalar and lateral subqueries

2024-03-07 Thread Jack Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824521#comment-17824521
 ] 

Jack Chen edited comment on SPARK-36113 at 3/7/24 8:26 PM:
---

The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row - that's added in constructLeftJoins. 
And if we handle count bug in DecorrelateInnerQuery using the existing logic, 
it would create another left join, which may be in the middle of the subquery 
plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join. Adding that outer join still needs to happen in the same place 
(rewriteDomainJoins), but perhaps we can move the logic of checking whether the 
subquery is affected by count bug all to DecorrelateInnerQuery, and that just 
leaves a flag for rewriteDomainJoins/constructLeftJoins later on.


was (Author: JIRAUSER299035):
The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row. And if we handle count bug in 
DecorrelateInnerQuery using the existing logic, it would create another left 
join, which may be in the middle of the subquery plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join. Adding that outer join still needs to happen in the same place 
(rewriteDomainJoins), but perhaps we can move the logic of checking whether the 
subquery is affected by count bug all to DecorrelateInnerQuery, and that just 
leaves a flag for rewriteDomainJoins later on.

> Unify the logic to handle COUNT bug for scalar and lateral subqueries
> -
>
> Key: SPARK-36113
> URL: https://issues.apache.org/jira/browse/SPARK-36113
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, for scalar subqueries, the logic to handle the count bug is in 
> `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral 
> subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. 
> We should unify them without introducing additional left outer joins for 
> scalar subqueries.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36113) Unify the logic to handle COUNT bug for scalar and lateral subqueries

2024-03-07 Thread Jack Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824521#comment-17824521
 ] 

Jack Chen edited comment on SPARK-36113 at 3/7/24 8:09 PM:
---

The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row. And if we handle count bug in 
DecorrelateInnerQuery using the existing logic, it would create another left 
join, which may be in the middle of the subquery plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join. Adding that outer join still needs to happen in the same place 
(rewriteDomainJoins), but perhaps we can move the logic of checking whether the 
subquery is affected by count bug all to DecorrelateInnerQuery, and that just 
leaves a flag for rewriteDomainJoins later on.


was (Author: JIRAUSER299035):
The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row. And if we handle count bug in 
DecorrelateInnerQuery using the existing logic, it would create another left 
join, which may be in the middle of the subquery plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join.

> Unify the logic to handle COUNT bug for scalar and lateral subqueries
> -
>
> Key: SPARK-36113
> URL: https://issues.apache.org/jira/browse/SPARK-36113
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, for scalar subqueries, the logic to handle the count bug is in 
> `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral 
> subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. 
> We should unify them without introducing additional left outer joins for 
> scalar subqueries.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36113) Unify the logic to handle COUNT bug for scalar and lateral subqueries

2024-03-07 Thread Jack Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824521#comment-17824521
 ] 

Jack Chen commented on SPARK-36113:
---

The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row. And if we handle count bug in 
DecorrelateInnerQuery using the existing logic, it would create another left 
join, which may be in the middle of the subquery plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join.

> Unify the logic to handle COUNT bug for scalar and lateral subqueries
> -
>
> Key: SPARK-36113
> URL: https://issues.apache.org/jira/browse/SPARK-36113
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, for scalar subqueries, the logic to handle the count bug is in 
> `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral 
> subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. 
> We should unify them without introducing additional left outer joins for 
> scalar subqueries.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46446) Correctness bug in correlated subquery with OFFSET

2023-12-18 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-46446:
--
Description: 
Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
introduced recently when support for correlation under OFFSET was enabled but 
were not handled correctly. (So we went from unsupported, query throws error -> 
wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

 

It's easy to repro with a query like
{code:java}
create table x(x1 int, x2 int);
insert into x values (1, 1), (2, 2);
create table y(y1 int, y2 int);
insert into y values (1, 1), (1, 2), (2, 4);


select * from x where exists (select * from y where x1 = y1 limit 1 offset 
2){code}
Correct result: empty set, see postgres: 
[https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 

Spark result: Array([2,2])

 

The 
[PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
 where it was introduced added a test for it, but the golden file results for 
the test actually were incorrect and we didn't notice. (The bug was initially 
found by https://github.com/apache/spark/pull/44084)

I'll work on both:
 * Adding support for offset in DecorrelateInnerQuery (the transformation is 
into a filter on row_number window function, similar to limit).

 * Adding a feature flag to enable/disable offset in subquery support

  was:
Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
introduced recently when support for correlation under OFFSET was enabled but 
were not handled correctly. (So we went from unsupported, query throws error -> 
wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

 

It's easy to repro with a query like
{code:java}
create table x(x1 int, x2 int);
insert into x values (1, 1), (2, 2);
create table y(y1 int, y2 int);
insert into y values (1, 1), (1, 2), (2, 4);


select * from x where exists (select * from y where x1 = y1 limit 1 offset 
2){code}
Correct result: empty set, see postgres: 
[https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 

Spark result: Array([2,2])

 

The 
[PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
 where it was introduced added a test for it, but the golden file results for 
the test actually were incorrect and we didn't notice.

I'll work on both:
 * Adding support for offset in DecorrelateInnerQuery (the transformation is 
into a filter on row_number window function, similar to limit).

 * Adding a feature flag to enable/disable offset in subquery support


> Correctness bug in correlated subquery with OFFSET
> --
>
> Key: SPARK-46446
> URL: https://issues.apache.org/jira/browse/SPARK-46446
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Priority: Major
>
> Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
> introduced recently when support for correlation under OFFSET was enabled but 
> were not handled correctly. (So we went from unsupported, query throws error 
> -> wrong results.)
> It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS
>  
> It's easy to repro with a query like
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1), (2, 2);
> create table y(y1 int, y2 int);
> insert into y values (1, 1), (1, 2), (2, 4);
> select * from x where exists (select * from y where x1 = y1 limit 1 offset 
> 2){code}
> Correct result: empty set, see postgres: 
> [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 
> Spark result: Array([2,2])
>  
> The 
> [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
>  where it was introduced added a test for it, but the golden file results for 
> the test actually were incorrect and we didn't notice. (The bug was initially 
> found by https://github.com/apache/spark/pull/44084)
> I'll work on both:
>  * Adding support for offset in DecorrelateInnerQuery (the transformation is 
> into a filter on row_number window function, similar to limit).
>  * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46446) Correctness bug in correlated subquery with OFFSET

2023-12-18 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-46446:
--
Description: 
Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
introduced recently when support for correlation under OFFSET was enabled but 
were not handled correctly. (So we went from unsupported, query throws error -> 
wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

 

It's easy to repro with a query like
{code:java}
create table x(x1 int, x2 int);
insert into x values (1, 1), (2, 2);
create table y(y1 int, y2 int);
insert into y values (1, 1), (1, 2), (2, 4);


select * from x where exists (select * from y where x1 = y1 limit 1 offset 
2){code}
Correct result: empty set, see postgres: 
[https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 

Spark result: Array([2,2])

 

The 
[PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
 where it was introduced added a test for it, but the golden file results for 
the test actually were incorrect and we didn't notice.

I'll work on both:
 * Adding support for offset in DecorrelateInnerQuery (the transformation is 
into a filter on row_number window function, similar to limit).

 * Adding a feature flag to enable/disable offset in subquery support

  was:
Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
introduced recently when support for correlation under OFFSET was enabled but 
were not handled correctly. (So we went from unsupported, query throws error -> 
wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

It's easy to repro with a query like
{code:java}
SELECT * 
FROM   emp 
join lateral   (SELECT dept.dept_name
               FROM   dept 
               WHERE  emp.dept_id = dept.dept_id
               LIMIT 5 OFFSET 3); {code}
The 
[PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
 where it was introduced added a test for it, but the golden file results for 
the test actually were incorrect and we didn't notice.

I'll work on both:
 * Adding support for offset in DecorrelateInnerQuery (the transformation is 
into a filter on row_number window function, similar to limit).

 * Adding a feature flag to enable/disable offset in subquery support


> Correctness bug in correlated subquery with OFFSET
> --
>
> Key: SPARK-46446
> URL: https://issues.apache.org/jira/browse/SPARK-46446
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Priority: Major
>
> Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
> introduced recently when support for correlation under OFFSET was enabled but 
> were not handled correctly. (So we went from unsupported, query throws error 
> -> wrong results.)
> It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS
>  
> It's easy to repro with a query like
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1), (2, 2);
> create table y(y1 int, y2 int);
> insert into y values (1, 1), (1, 2), (2, 4);
> select * from x where exists (select * from y where x1 = y1 limit 1 offset 
> 2){code}
> Correct result: empty set, see postgres: 
> [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 
> Spark result: Array([2,2])
>  
> The 
> [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
>  where it was introduced added a test for it, but the golden file results for 
> the test actually were incorrect and we didn't notice.
> I'll work on both:
>  * Adding support for offset in DecorrelateInnerQuery (the transformation is 
> into a filter on row_number window function, similar to limit).
>  * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46446) Correctness bug in correlated subquery with OFFSET

2023-12-18 Thread Jack Chen (Jira)
Jack Chen created SPARK-46446:
-

 Summary: Correctness bug in correlated subquery with OFFSET
 Key: SPARK-46446
 URL: https://issues.apache.org/jira/browse/SPARK-46446
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jack Chen


Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
introduced recently when support for correlation under OFFSET was enabled but 
were not handled correctly. (So we went from unsupported, query throws error -> 
wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

It's easy to repro with a query like
{code:java}
SELECT * 
FROM   emp 
join lateral   (SELECT dept.dept_name
               FROM   dept 
               WHERE  emp.dept_id = dept.dept_id
               LIMIT 5 OFFSET 3); {code}
The 
[PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
 where it was introduced added a test for it, but the golden file results for 
the test actually were incorrect and we didn't notice.

I'll work on both:
 * Adding support for offset in DecorrelateInnerQuery (the transformation is 
into a filter on row_number window function, similar to limit).

 * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45009) Correlated EXISTS subqueries in join ON condition unsupported and fail with internal error

2023-08-31 Thread Jack Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761003#comment-17761003
 ] 

Jack Chen commented on SPARK-45009:
---

https://github.com/apache/spark/pull/42725

> Correlated EXISTS subqueries in join ON condition unsupported and fail with 
> internal error
> --
>
> Key: SPARK-45009
> URL: https://issues.apache.org/jira/browse/SPARK-45009
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> They are not handled in decorrelation, and we also don’t have any checks to 
> block them, so these queries have outer references in the query plan leading 
> to internal errors:
> {code:java}
> CREATE TEMP VIEW x(x1, x2) AS VALUES (0, 1), (1, 2);
> CREATE TEMP VIEW y(y1, y2) AS VALUES (0, 2), (0, 3);
> CREATE TEMP VIEW z(z1, z2) AS VALUES (0, 2), (0, 3);
> select * from x left join y on x1 = y1 and exists (select * from z where z1 = 
> x1)
> Error occurred during query planning: 
> org.apache.spark.sql.catalyst.plans.logical.Filter cannot be cast to 
> org.apache.spark.sql.execution.SparkPlan {code}
> PullupCorrelatedPredicates#apply and RewritePredicateSubquery only handle 
> subqueries in UnaryNode, it seems to assume that they cannot occur elsewhere, 
> like in a join ON condition.
> We will need to decide whether to block them properly in analysis (i.e. give 
> a proper error for them), or see if we can add support for them.
> Also note, scalar subqueries in the ON condition are unsupported too but 
> return a proper error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45009) Correlated EXISTS subqueries in join ON condition unsupported and fail with internal error

2023-08-29 Thread Jack Chen (Jira)
Jack Chen created SPARK-45009:
-

 Summary: Correlated EXISTS subqueries in join ON condition 
unsupported and fail with internal error
 Key: SPARK-45009
 URL: https://issues.apache.org/jira/browse/SPARK-45009
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jack Chen


They are not handled in decorrelation, and we also don’t have any checks to 
block them, so these queries have outer references in the query plan leading to 
internal errors:
{code:java}
CREATE TEMP VIEW x(x1, x2) AS VALUES (0, 1), (1, 2);
CREATE TEMP VIEW y(y1, y2) AS VALUES (0, 2), (0, 3);
CREATE TEMP VIEW z(z1, z2) AS VALUES (0, 2), (0, 3);
select * from x left join y on x1 = y1 and exists (select * from z where z1 = 
x1)

Error occurred during query planning: 
org.apache.spark.sql.catalyst.plans.logical.Filter cannot be cast to 
org.apache.spark.sql.execution.SparkPlan {code}
PullupCorrelatedPredicates#apply and RewritePredicateSubquery only handle 
subqueries in UnaryNode, it seems to assume that they cannot occur elsewhere, 
like in a join ON condition.

We will need to decide whether to block them properly in analysis (i.e. give a 
proper error for them), or see if we can add support for them.

Also note, scalar subqueries in the ON condition are unsupported too but return 
a proper error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44550) Wrong semantics for null IN (empty list)

2023-07-25 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-44550:
--
Fix Version/s: (was: 3.5.0)

> Wrong semantics for null IN (empty list)
> 
>
> Key: SPARK-44550
> URL: https://issues.apache.org/jira/browse/SPARK-44550
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>
> {{null IN (empty list)}} incorrectly evaluates to null, when it should 
> evaluate to false. (The reason it should be false is because a IN (b1, b2) is 
> defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR 
> which is false. This is specified by ANSI SQL.)
> Many places in Spark execution (In, InSet, InSubquery) and optimization 
> (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
> the Spark behavior for the null IN (empty list) is inconsistent in some 
> places - literal IN lists generally return null (incorrect), while IN/NOT IN 
> subqueries mostly return false/true, respectively (correct) in this case.
> This is a longstanding correctness issue which has existed since null support 
> for IN expressions was first added to Spark.
> Doc with more details: 
> [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44551) Wrong semantics for null IN (empty list) - IN expression execution

2023-07-25 Thread Jack Chen (Jira)
Jack Chen created SPARK-44551:
-

 Summary: Wrong semantics for null IN (empty list) - IN expression 
execution
 Key: SPARK-44551
 URL: https://issues.apache.org/jira/browse/SPARK-44551
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jack Chen


{{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate 
to false. (The reason it should be false is because a IN (b1, b2) is defined as 
a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is 
false. This is specified by ANSI SQL.)

Many places in Spark execution (In, InSet, InSubquery) and optimization 
(OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
the Spark behavior for the null IN (empty list) is inconsistent in some places 
- literal IN lists generally return null (incorrect), while IN/NOT IN 
subqueries mostly return false/true, respectively (correct) in this case.

This is a longstanding correctness issue which has existed since null support 
for IN expressions was first added to Spark.

Doc with more details: 
[https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44431) Wrong semantics for null IN (empty list) - optimization rules

2023-07-25 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-44431:
--
Parent: SPARK-44550
Issue Type: Sub-task  (was: Bug)

> Wrong semantics for null IN (empty list) - optimization rules
> -
>
> Key: SPARK-44431
> URL: https://issues.apache.org/jira/browse/SPARK-44431
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.5.0
>
>
> {{null IN (empty list)}} incorrectly evaluates to null, when it should 
> evaluate to false. (The reason it should be false is because a IN (b1, b2) is 
> defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR 
> which is false. This is specified by ANSI SQL.)
> Many places in Spark execution (In, InSet, InSubquery) and optimization 
> (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
> the Spark behavior for the null IN (empty list) is inconsistent in some 
> places - literal IN lists generally return null (incorrect), while IN/NOT IN 
> subqueries mostly return false/true, respectively (correct) in this case.
> This is a longstanding correctness issue which has existed since null support 
> for IN expressions was first added to Spark.
> Doc with more details: 
> [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44431) Wrong semantics for null IN (empty list) - optimization rules

2023-07-25 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-44431:
--
Summary: Wrong semantics for null IN (empty list) - optimization rules  
(was: Wrong semantics for null IN (empty list))

> Wrong semantics for null IN (empty list) - optimization rules
> -
>
> Key: SPARK-44431
> URL: https://issues.apache.org/jira/browse/SPARK-44431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.5.0
>
>
> {{null IN (empty list)}} incorrectly evaluates to null, when it should 
> evaluate to false. (The reason it should be false is because a IN (b1, b2) is 
> defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR 
> which is false. This is specified by ANSI SQL.)
> Many places in Spark execution (In, InSet, InSubquery) and optimization 
> (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
> the Spark behavior for the null IN (empty list) is inconsistent in some 
> places - literal IN lists generally return null (incorrect), while IN/NOT IN 
> subqueries mostly return false/true, respectively (correct) in this case.
> This is a longstanding correctness issue which has existed since null support 
> for IN expressions was first added to Spark.
> Doc with more details: 
> [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44550) Wrong semantics for null IN (empty list)

2023-07-25 Thread Jack Chen (Jira)
Jack Chen created SPARK-44550:
-

 Summary: Wrong semantics for null IN (empty list)
 Key: SPARK-44550
 URL: https://issues.apache.org/jira/browse/SPARK-44550
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jack Chen
Assignee: Jack Chen
 Fix For: 3.5.0


{{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate 
to false. (The reason it should be false is because a IN (b1, b2) is defined as 
a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is 
false. This is specified by ANSI SQL.)

Many places in Spark execution (In, InSet, InSubquery) and optimization 
(OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
the Spark behavior for the null IN (empty list) is inconsistent in some places 
- literal IN lists generally return null (incorrect), while IN/NOT IN 
subqueries mostly return false/true, respectively (correct) in this case.

This is a longstanding correctness issue which has existed since null support 
for IN expressions was first added to Spark.

Doc with more details: 
[https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator

2023-07-18 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-8:
--
Affects Version/s: 3.5.0
   (was: 3.4.0)

> Wrong results for dense_rank() <= k from InferWindowGroupLimit and 
> DenseRankLimitIterator
> -
>
> Key: SPARK-8
> URL: https://issues.apache.org/jira/browse/SPARK-8
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jack Chen
>Priority: Major
>
> Top-k filters on a dense_rank() window function return wrong results, due to 
> a bug in optimization InferWindowGroupLimit, specifically in the code for 
> DenseRankLimitIterator, introduced in 
> https://issues.apache.org/jira/browse/SPARK-37099.
> Repro:
> {code:java}
> create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 
> 1), (2, 1), (2, 2);
> select * from (select *, dense_rank() over (partition by p order by o) as rnk 
> from t1) where rnk = 1;{code}
> Spark result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]{code}
> Correct result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]
> [2,1,1]{code}
>  
> The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state 
> properly when transitioning from one window partition to the next. {{reset}} 
> only resets {{{}rank = 0{}}}, what it is missing is to reset 
> {{{}currentRankRow = null{}}}. This means that when processing the second and 
> later window partitions, the rank incorrectly gets incremented based on 
> comparing the ordering of the last row of the previous partition to the first 
> row of the new partition.
> This means that a dense_rank window func that has more than one window 
> partition and more than one row with dense_rank = 1 in the second or later 
> partitions can give wrong results when optimized.
> ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the 
> first row in the new partition will try to increment rank, but increment it 
> by the value of count which is 0, so it happens to work by accident).
> Unfortunately, tests for the optimization only had a single row per rank, so 
> did not catch the bug as the bug requires multiple rows per rank.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator

2023-07-16 Thread Jack Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743594#comment-17743594
 ] 

Jack Chen commented on SPARK-8:
---

Fix PR: https://github.com/apache/spark/pull/42026

> Wrong results for dense_rank() <= k from InferWindowGroupLimit and 
> DenseRankLimitIterator
> -
>
> Key: SPARK-8
> URL: https://issues.apache.org/jira/browse/SPARK-8
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> Top-k filters on a dense_rank() window function return wrong results, due to 
> a bug in optimization InferWindowGroupLimit, specifically in the code for 
> DenseRankLimitIterator, introduced in 
> https://issues.apache.org/jira/browse/SPARK-37099.
> Repro:
> {code:java}
> create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 
> 1), (2, 1), (2, 2);
> select * from (select *, dense_rank() over (partition by p order by o) as rnk 
> from t1) where rnk = 1;{code}
> Spark result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]{code}
> Correct result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]
> [2,1,1]{code}
>  
> The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state 
> properly when transitioning from one window partition to the next. {{reset}} 
> only resets {{{}rank = 0{}}}, what it is missing is to reset 
> {{{}currentRankRow = null{}}}. This means that when processing the second and 
> later window partitions, the rank incorrectly gets incremented based on 
> comparing the ordering of the last row of the previous partition to the first 
> row of the new partition.
> This means that a dense_rank window func that has more than one window 
> partition and more than one row with dense_rank = 1 in the second or later 
> partitions can give wrong results when optimized.
> ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the 
> first row in the new partition will try to increment rank, but increment it 
> by the value of count which is 0, so it happens to work by accident).
> Unfortunately, tests for the optimization only had a single row per rank, so 
> did not catch the bug as the bug requires multiple rows per rank.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator

2023-07-16 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-8:
--
Description: 
Top-k filters on a dense_rank() window function return wrong results, due to a 
bug in optimization InferWindowGroupLimit, specifically in the code for 
DenseRankLimitIterator, introduced in 
https://issues.apache.org/jira/browse/SPARK-37099.

Repro:
{code:java}
create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 1), 
(2, 1), (2, 2);

select * from (select *, dense_rank() over (partition by p order by o) as rnk 
from t1) where rnk = 1;{code}
Spark result:
{code:java}
[1,1,1]
[1,1,1]
[2,1,1]{code}
Correct result:
{code:java}
[1,1,1]
[1,1,1]
[2,1,1]
[2,1,1]{code}
 

The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state properly 
when transitioning from one window partition to the next. {{reset}} only resets 
{{{}rank = 0{}}}, what it is missing is to reset {{{}currentRankRow = null{}}}. 
This means that when processing the second and later window partitions, the 
rank incorrectly gets incremented based on comparing the ordering of the last 
row of the previous partition to the first row of the new partition.

This means that a dense_rank window func that has more than one window 
partition and more than one row with dense_rank = 1 in the second or later 
partitions can give wrong results when optimized.

({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the first 
row in the new partition will try to increment rank, but increment it by the 
value of count which is 0, so it happens to work by accident).

Unfortunately, tests for the optimization only had a single row per rank, so 
did not catch the bug as the bug requires multiple rows per rank.

  was:
Top-k filters on a dense_rank() window function return wrong results, due to a 
bug in optimization InferWindowGroupLimit, specifically in the code for 
DenseRankLimitIterator.

Repro:
{code:java}
create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 1), 
(2, 1), (2, 2);

select * from (select *, dense_rank() over (partition by p order by o) as rnk 
from t1) where rnk = 1;{code}
Spark result:
{code:java}
[1,1,1]
[1,1,1]
[2,1,1]{code}
Correct result:
{code:java}
[1,1,1]
[1,1,1]
[2,1,1]
[2,1,1]{code}
 

The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state properly 
when transitioning from one window partition to the next. {{reset}} only resets 
{{{}rank = 0{}}}, what it is missing is to reset {{{}currentRankRow = null{}}}. 
This means that when processing the second and later window partitions, the 
rank incorrectly gets incremented based on comparing the ordering of the last 
row of the previous partition to the first row of the new partition.

This means that a dense_rank window func that has more than one window 
partition and more than one row with dense_rank = 1 in the second or later 
partitions can give wrong results when optimized.

({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the first 
row in the new partition will try to increment rank, but increment it by the 
value of count which is 0, so it happens to work by accident).

Unfortunately, tests for the optimization only had a single row per rank, so 
did not catch the bug as the bug requires multiple rows per rank.


> Wrong results for dense_rank() <= k from InferWindowGroupLimit and 
> DenseRankLimitIterator
> -
>
> Key: SPARK-8
> URL: https://issues.apache.org/jira/browse/SPARK-8
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> Top-k filters on a dense_rank() window function return wrong results, due to 
> a bug in optimization InferWindowGroupLimit, specifically in the code for 
> DenseRankLimitIterator, introduced in 
> https://issues.apache.org/jira/browse/SPARK-37099.
> Repro:
> {code:java}
> create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 
> 1), (2, 1), (2, 2);
> select * from (select *, dense_rank() over (partition by p order by o) as rnk 
> from t1) where rnk = 1;{code}
> Spark result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]{code}
> Correct result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]
> [2,1,1]{code}
>  
> The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state 
> properly when transitioning from one window partition to the next. {{reset}} 
> only resets {{{}rank = 0{}}}, what it is missing is to reset 
> {{{}currentRankRow = null{}}}. This means that when processing the second and 
> later window partitions, the rank incorrectly gets incremented based on 
> comparing the ordering of the last row of the previous partition to the first 
> row of the new 

[jira] [Created] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator

2023-07-16 Thread Jack Chen (Jira)
Jack Chen created SPARK-8:
-

 Summary: Wrong results for dense_rank() <= k from 
InferWindowGroupLimit and DenseRankLimitIterator
 Key: SPARK-8
 URL: https://issues.apache.org/jira/browse/SPARK-8
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jack Chen


Top-k filters on a dense_rank() window function return wrong results, due to a 
bug in optimization InferWindowGroupLimit, specifically in the code for 
DenseRankLimitIterator.

Repro:
{code:java}
create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 1), 
(2, 1), (2, 2);

select * from (select *, dense_rank() over (partition by p order by o) as rnk 
from t1) where rnk = 1;{code}
Spark result:
{code:java}
[1,1,1]
[1,1,1]
[2,1,1]{code}
Correct result:
{code:java}
[1,1,1]
[1,1,1]
[2,1,1]
[2,1,1]{code}
 

The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state properly 
when transitioning from one window partition to the next. {{reset}} only resets 
{{{}rank = 0{}}}, what it is missing is to reset {{{}currentRankRow = null{}}}. 
This means that when processing the second and later window partitions, the 
rank incorrectly gets incremented based on comparing the ordering of the last 
row of the previous partition to the first row of the new partition.

This means that a dense_rank window func that has more than one window 
partition and more than one row with dense_rank = 1 in the second or later 
partitions can give wrong results when optimized.

({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the first 
row in the new partition will try to increment rank, but increment it by the 
value of count which is 0, so it happens to work by accident).

Unfortunately, tests for the optimization only had a single row per rank, so 
did not catch the bug as the bug requires multiple rows per rank.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44431) Wrong semantics for null IN (empty list)

2023-07-14 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-44431:
--
Description: 
{{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate 
to false. (The reason it should be false is because a IN (b1, b2) is defined as 
a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is 
false. This is specified by ANSI SQL.)

Many places in Spark execution (In, InSet, InSubquery) and optimization 
(OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
the Spark behavior for the null IN (empty list) is inconsistent in some places 
- literal IN lists generally return null (incorrect), while IN/NOT IN 
subqueries mostly return false/true, respectively (correct) in this case.

This is a longstanding correctness issue which has existed since null support 
for IN expressions was first added to Spark.

Doc with more details: 
[https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
 

  was:
{{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate 
to false. (The reason it should be false is because a IN (b1, b2) is defined as 
a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is 
false. This is specified by ANSI SQL.)

Many places in Spark execution (In, InSet, InSubquery) and optimization 
(OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
the Spark behavior for the null IN (empty list) is inconsistent in some places 
- literal IN lists generally return null (incorrect), while IN/NOT IN 
subqueries mostly return false/true, respectively (correct) in this case.

This is a longstanding correctness issue which has existed since null support 
for IN expressions was first added to Spark.

Doc with more details: 
https://docs.google.com/document/d/15ttcB3OjGx5_WFKHB2COjQUbFHj5LrfNQv_d26o-wmI/edit


> Wrong semantics for null IN (empty list)
> 
>
> Key: SPARK-44431
> URL: https://issues.apache.org/jira/browse/SPARK-44431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> {{null IN (empty list)}} incorrectly evaluates to null, when it should 
> evaluate to false. (The reason it should be false is because a IN (b1, b2) is 
> defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR 
> which is false. This is specified by ANSI SQL.)
> Many places in Spark execution (In, InSet, InSubquery) and optimization 
> (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
> the Spark behavior for the null IN (empty list) is inconsistent in some 
> places - literal IN lists generally return null (incorrect), while IN/NOT IN 
> subqueries mostly return false/true, respectively (correct) in this case.
> This is a longstanding correctness issue which has existed since null support 
> for IN expressions was first added to Spark.
> Doc with more details: 
> [https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44431) Wrong semantics for null IN (empty list)

2023-07-14 Thread Jack Chen (Jira)
Jack Chen created SPARK-44431:
-

 Summary: Wrong semantics for null IN (empty list)
 Key: SPARK-44431
 URL: https://issues.apache.org/jira/browse/SPARK-44431
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jack Chen


{{null IN (empty list)}} incorrectly evaluates to null, when it should evaluate 
to false. (The reason it should be false is because a IN (b1, b2) is defined as 
a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is 
false. This is specified by ANSI SQL.)

Many places in Spark execution (In, InSet, InSubquery) and optimization 
(OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that 
the Spark behavior for the null IN (empty list) is inconsistent in some places 
- literal IN lists generally return null (incorrect), while IN/NOT IN 
subqueries mostly return false/true, respectively (correct) in this case.

This is a longstanding correctness issue which has existed since null support 
for IN expressions was first added to Spark.

Doc with more details: 
https://docs.google.com/document/d/15ttcB3OjGx5_WFKHB2COjQUbFHj5LrfNQv_d26o-wmI/edit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull

2023-05-19 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43596:
--
Description: 
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull(innercol)` as `innercol <=> null` in 
rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.

  was:
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.


> Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to 
> isnull
> 
>
> Key: SPARK-43596
> URL: https://issues.apache.org/jira/browse/SPARK-43596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> We can get a decorrelation error because of rewrites that run in between 
> DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation 
> join conditions. In particular, ConstantFolding can transform `innercol <=> 
> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize 
> this and throws error Unable to rewrite domain join with conditions: 
> ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so 
> it isn't usable for rewriting the domain join.
> Can fix by recognizing `isnull(innercol)` as `innercol <=> null` in 
> rewriteDomainJoins.
> This area is also fragile in general and other rewrites that run between the 
> two steps of decorrelation could potentially break their assumptions, so we 
> may want to investigate longer-term follow ups for that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull

2023-05-19 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43596:
--
Description: 
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.

  was:
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull\(x)` as `x <=> null` in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.}}{}}}


> Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to 
> isnull
> 
>
> Key: SPARK-43596
> URL: https://issues.apache.org/jira/browse/SPARK-43596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> We can get a decorrelation error because of rewrites that run in between 
> DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation 
> join conditions. In particular, ConstantFolding can transform `innercol <=> 
> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize 
> this and throws error Unable to rewrite domain join with conditions: 
> ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so 
> it isn't usable for rewriting the domain join.
> Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins.
> This area is also fragile in general and other rewrites that run between the 
> two steps of decorrelation could potentially break their assumptions, so we 
> may want to investigate longer-term follow ups for that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull

2023-05-19 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43596:
--
Description: 
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.}}{}}}

  was:
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing isnull(x) as x <=> null in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.{{{}{}}}


> Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to 
> isnull
> 
>
> Key: SPARK-43596
> URL: https://issues.apache.org/jira/browse/SPARK-43596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> We can get a decorrelation error because of rewrites that run in between 
> DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation 
> join conditions. In particular, ConstantFolding can transform `innercol <=> 
> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize 
> this and throws error Unable to rewrite domain join with conditions: 
> ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so 
> it isn't usable for rewriting the domain join.
> Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins.
> This area is also fragile in general and other rewrites that run between the 
> two steps of decorrelation could potentially break their assumptions, so we 
> may want to investigate longer-term follow ups for that.}}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull

2023-05-19 Thread Jack Chen (Jira)
Jack Chen created SPARK-43596:
-

 Summary: Subquery decorrelation rewriteDomainJoins failure from 
ConstantFolding to isnull
 Key: SPARK-43596
 URL: https://issues.apache.org/jira/browse/SPARK-43596
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jack Chen


We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing isnull(x) as x <=> null in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43596) Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to isnull

2023-05-19 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43596:
--
Description: 
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull\(x)` as `x <=> null` in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.}}{}}}

  was:
We can get a decorrelation error because of rewrites that run in between 
DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation join 
conditions. In particular, ConstantFolding can transform `innercol <=> null` to 
`isnull(innercol)` and then rewriteDomainJoins does not recognize this and 
throws error Unable to rewrite domain join with conditions: 
ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so it 
isn't usable for rewriting the domain join.

Can fix by recognizing `isnull(x)` as `x <=> null` in rewriteDomainJoins.

This area is also fragile in general and other rewrites that run between the 
two steps of decorrelation could potentially break their assumptions, so we may 
want to investigate longer-term follow ups for that.}}{}}}


> Subquery decorrelation rewriteDomainJoins failure from ConstantFolding to 
> isnull
> 
>
> Key: SPARK-43596
> URL: https://issues.apache.org/jira/browse/SPARK-43596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> We can get a decorrelation error because of rewrites that run in between 
> DecorrelateInnerQuery and rewriteDomainJoins, that modify the correlation 
> join conditions. In particular, ConstantFolding can transform `innercol <=> 
> null` to `isnull(innercol)` and then rewriteDomainJoins does not recognize 
> this and throws error Unable to rewrite domain join with conditions: 
> ArrayBuffer(isnull(innercol#280)) because the isnull is not an equality, so 
> it isn't usable for rewriting the domain join.
> Can fix by recognizing `isnull\(x)` as `x <=> null` in rewriteDomainJoins.
> This area is also fragile in general and other rewrites that run between the 
> two steps of decorrelation could potentially break their assumptions, so we 
> may want to investigate longer-term follow ups for that.}}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43413) IN subquery ListQuery has wrong nullability

2023-05-08 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43413:
--
Description: 
IN subquery expressions are incorrectly marked as non-nullable, even when they 
are actually nullable. They correctly check the nullability of the 
left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
currently defined with nullability = false always. This is incorrect and can 
lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This bug can potentially lead to wrong results, but in most cases this doesn't 
directly cause wrong results end-to-end, because IN subqueries are almost 
always transformed to semi/anti/existence joins in RewritePredicateSubquery, 
and this rewrite can also incorrectly discard NULLs, which is another bug. But 
we can observe it causing wrong behavior in unit tests, and it could easily 
lead to incorrect query results if there are changes to the surrounding 
context, so it should be fixed regardless.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.

  was:
IN subquery expressions are incorrectly always marked as non-nullable, even 
when they are actually nullable. They correctly check the nullability of the 
left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
currently defined with nullability = false always. This is incorrect and can 
lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This bug can potentially lead to wrong results, but in most cases this doesn't 
directly cause wrong results end-to-end, because IN subqueries are almost 
always transformed to semi/anti/existence joins in RewritePredicateSubquery, 
and this rewrite can also incorrectly discard NULLs, which is another bug. But 
we can observe it causing wrong behavior in unit tests, and it could easily 
lead to incorrect query results if there are changes to the surrounding 
context, so it should be fixed regardless.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.


> IN subquery ListQuery has wrong nullability
> ---
>
> Key: SPARK-43413
> URL: https://issues.apache.org/jira/browse/SPARK-43413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> IN subquery expressions are incorrectly marked as non-nullable, even when 
> they are actually nullable. They correctly check the nullability of the 
> left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
> currently defined with nullability = false always. This is incorrect and can 
> lead to incorrect query transformations.
> Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
> expression returns NULL when the nullable_col is null, but our code marks it 
> as non-nullable, and therefore SimplifyBinaryComparison transforms away the 
> <=> TRUE, transforming the expression to non_nullable_col IN (select 
> nullable_col) , which is an incorrect transformation because NULL values of 
> nullable_col now cause the expression to yield NULL instead of FALSE.
> This bug can potentially lead to wrong results, but in most cases this 
> doesn't directly cause wrong results end-to-end, because IN subqueries are 
> almost always transformed to semi/anti/existence joins in 
> RewritePredicateSubquery, and this rewrite can also incorrectly discard 
> NULLs, which is another bug. But we can observe it causing wrong behavior in 
> unit tests, and it could easily lead to incorrect query results if there are 
> changes to the surrounding context, so it should be fixed regardless.
> This is a long-standing bug that has existed at least since 2016, as long as 
> the ListQuery class has existed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (SPARK-43413) IN subquery ListQuery has wrong nullability

2023-05-08 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43413:
--
Description: 
IN subquery expressions are incorrectly always marked as non-nullable, even 
when they are actually nullable. They correctly check the nullability of the 
left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
currently defined with nullability = false always. This is incorrect and can 
lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This bug can potentially lead to wrong results, but in most cases this doesn't 
directly cause wrong results end-to-end, because IN subqueries are almost 
always transformed to semi/anti/existence joins in RewritePredicateSubquery, 
and this rewrite can also incorrectly discard NULLs, which is another bug. But 
we can observe it causing wrong behavior in unit tests, and it could easily 
lead to incorrect query results if there are changes to the surrounding 
context, so it should be fixed regardless.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.

  was:
IN subquery expressions currently are marked as nullable if and only if the 
left-hand-side is nullable - because the right-hand-side of a IN subquery, the 
ListQuery, is currently defined with nullability = false always. This is 
incorrect and can lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This bug can potentially lead to wrong results, but in most cases this doesn't 
directly cause wrong results end-to-end, because IN subqueries are almost 
always transformed to semi/anti/existence joins in RewritePredicateSubquery, 
and this rewrite can also incorrectly discard NULLs, which is another bug. But 
we can observe it causing wrong behavior in unit tests, and it could easily 
lead to incorrect query results if there are changes to the surrounding 
context, so it should be fixed regardless.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.


> IN subquery ListQuery has wrong nullability
> ---
>
> Key: SPARK-43413
> URL: https://issues.apache.org/jira/browse/SPARK-43413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> IN subquery expressions are incorrectly always marked as non-nullable, even 
> when they are actually nullable. They correctly check the nullability of the 
> left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
> currently defined with nullability = false always. This is incorrect and can 
> lead to incorrect query transformations.
> Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
> expression returns NULL when the nullable_col is null, but our code marks it 
> as non-nullable, and therefore SimplifyBinaryComparison transforms away the 
> <=> TRUE, transforming the expression to non_nullable_col IN (select 
> nullable_col) , which is an incorrect transformation because NULL values of 
> nullable_col now cause the expression to yield NULL instead of FALSE.
> This bug can potentially lead to wrong results, but in most cases this 
> doesn't directly cause wrong results end-to-end, because IN subqueries are 
> almost always transformed to semi/anti/existence joins in 
> RewritePredicateSubquery, and this rewrite can also incorrectly discard 
> NULLs, which is another bug. But we can observe it causing wrong behavior in 
> unit tests, and it could easily lead to incorrect query results if there are 
> changes to the surrounding context, so it should be fixed regardless.
> This is a long-standing bug that has existed at least since 2016, as long as 
> the ListQuery class has existed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-43413) IN subquery ListQuery has wrong nullability

2023-05-08 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43413:
--
Description: 
IN subquery expressions currently are marked as nullable if and only if the 
left-hand-side is nullable - because the right-hand-side of a IN subquery, the 
ListQuery, is currently defined with nullability = false always. This is 
incorrect and can lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This bug can potentially lead to wrong results, but in most cases this doesn't 
directly cause wrong results end-to-end, because IN subqueries are almost 
always transformed to semi/anti/existence joins in RewritePredicateSubquery, 
and this rewrite can also incorrectly discard NULLs, which is another bug. But 
we can observe it causing wrong behavior in unit tests, and it could easily 
lead to incorrect query results if there are changes to the surrounding 
context, so it should be fixed regardless.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.

  was:
IN subquery expressions currently are marked as nullable if and only if the 
left-hand-side is nullable - because the right-hand-side of a IN subquery, the 
ListQuery, is currently defined with nullability = false always. This is 
incorrect and can lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.


> IN subquery ListQuery has wrong nullability
> ---
>
> Key: SPARK-43413
> URL: https://issues.apache.org/jira/browse/SPARK-43413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> IN subquery expressions currently are marked as nullable if and only if the 
> left-hand-side is nullable - because the right-hand-side of a IN subquery, 
> the ListQuery, is currently defined with nullability = false always. This is 
> incorrect and can lead to incorrect query transformations.
> Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
> expression returns NULL when the nullable_col is null, but our code marks it 
> as non-nullable, and therefore SimplifyBinaryComparison transforms away the 
> <=> TRUE, transforming the expression to non_nullable_col IN (select 
> nullable_col) , which is an incorrect transformation because NULL values of 
> nullable_col now cause the expression to yield NULL instead of FALSE.
> This bug can potentially lead to wrong results, but in most cases this 
> doesn't directly cause wrong results end-to-end, because IN subqueries are 
> almost always transformed to semi/anti/existence joins in 
> RewritePredicateSubquery, and this rewrite can also incorrectly discard 
> NULLs, which is another bug. But we can observe it causing wrong behavior in 
> unit tests, and it could easily lead to incorrect query results if there are 
> changes to the surrounding context, so it should be fixed regardless.
> This is a long-standing bug that has existed at least since 2016, as long as 
> the ListQuery class has existed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43413) IN subquery ListQuery has wrong nullability

2023-05-08 Thread Jack Chen (Jira)
Jack Chen created SPARK-43413:
-

 Summary: IN subquery ListQuery has wrong nullability
 Key: SPARK-43413
 URL: https://issues.apache.org/jira/browse/SPARK-43413
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jack Chen


IN subquery expressions currently are marked as nullable if and only if the 
left-hand-side is nullable - because the right-hand-side of a IN subquery, the 
ListQuery, is currently defined with nullability = false always. This is 
incorrect and can lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`

2023-04-26 Thread Jack Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716952#comment-17716952
 ] 

Jack Chen commented on SPARK-43156:
---

My above pull request only added a related test. Hisoka-X's PR 
[https://github.com/apache/spark/pull/40865] is in progress with the fix.

> Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
> 
>
> Key: SPARK-43156
> URL: https://issues.apache.org/jira/browse/SPARK-43156
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.5.0
>
>
> Example query:
> {code:java}
> spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) 
> from t0").collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  
> {code}
> In this subquery, count(1) always evaluates to a non-null integer value, so 
> count(1) is null is always false. The correct evaluation of the subquery is 
> always false.
> We incorrectly evaluate it to null for empty groups. The reason is that 
> NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] 
> [false] - this rewrite would be correct normally, but in the context of a 
> scalar subquery it breaks our count bug handling in 
> RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get 
> there, the query appears to not have the count bug - it looks the same as if 
> the original query had a subquery with select any_value(false) from r..., and 
> that case is _not_ subject to the count bug.
>  
> Postgres comparison show correct always-false result: 
> [http://sqlfiddle.com/#!17/67822/5]
> DDL for the example:
> {code:java}
> create or replace temp view t0 (a, b)
> as values
>     (1, 1.0),
>     (2, 2.0);
> create or replace temp view t1 (c, d)
> as values
>     (2, 3.0); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`

2023-04-26 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen reopened SPARK-43156:
---

> Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
> 
>
> Key: SPARK-43156
> URL: https://issues.apache.org/jira/browse/SPARK-43156
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.5.0
>
>
> Example query:
> {code:java}
> spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) 
> from t0").collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  
> {code}
> In this subquery, count(1) always evaluates to a non-null integer value, so 
> count(1) is null is always false. The correct evaluation of the subquery is 
> always false.
> We incorrectly evaluate it to null for empty groups. The reason is that 
> NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] 
> [false] - this rewrite would be correct normally, but in the context of a 
> scalar subquery it breaks our count bug handling in 
> RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get 
> there, the query appears to not have the count bug - it looks the same as if 
> the original query had a subquery with select any_value(false) from r..., and 
> that case is _not_ subject to the count bug.
>  
> Postgres comparison show correct always-false result: 
> [http://sqlfiddle.com/#!17/67822/5]
> DDL for the example:
> {code:java}
> create or replace temp view t0 (a, b)
> as values
>     (1, 1.0),
>     (2, 2.0);
> create or replace temp view t1 (c, d)
> as values
>     (2, 3.0); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`

2023-04-16 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43156:
--
Description: 
Example query:
{code:java}
spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) from 
t0").collect()
res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  
{code}
In this subquery, count(1) always evaluates to a non-null integer value, so 
count(1) is null is always false. The correct evaluation of the subquery is 
always false.

We incorrectly evaluate it to null for empty groups. The reason is that 
NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] 
[false] - this rewrite would be correct normally, but in the context of a 
scalar subquery it breaks our count bug handling in 
RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, 
the query appears to not have the count bug - it looks the same as if the 
original query had a subquery with select any_value(false) from r..., and that 
case is _not_ subject to the count bug.

 

Postgres comparison show correct always-false result: 
[http://sqlfiddle.com/#!17/67822/5]

DDL for the example:
{code:java}
create or replace temp view t0 (a, b)
as values
    (1, 1.0),
    (2, 2.0);
create or replace temp view t1 (c, d)
as values
    (2, 3.0); {code}

  was:
Example query:
{code:java}
spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) from 
t0").collect()
res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  
{code}
In this subquery, count(1) always evaluates to a non-null integer value, so 
count(1) is null is always false. The correct evaluation of the subquery is 
always false.

We incorrectly evaluate it to null for empty groups. The reason is that 
NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] 
[false] - this rewrite would be correct normally, but in the context of a 
scalar subquery it breaks our count bug handling in 
RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, 
the query appears to not have the count bug - it looks the same as if the 
original query had a subquery with select any_value(false) from r..., and that 
case is _not_ subject to the count bug.

 

Postgres comparison show correct always-false result: 
[http://sqlfiddle.com/#!17/67822/5]

DDL:
{code:java}
create or replace temp view t0 (a, b)
as values
    (1, 1.0),
    (2, 2.0);
create or replace temp view t1 (c, d)
as values
    (2, 3.0); {code}


> Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
> 
>
> Key: SPARK-43156
> URL: https://issues.apache.org/jira/browse/SPARK-43156
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> Example query:
> {code:java}
> spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) 
> from t0").collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  
> {code}
> In this subquery, count(1) always evaluates to a non-null integer value, so 
> count(1) is null is always false. The correct evaluation of the subquery is 
> always false.
> We incorrectly evaluate it to null for empty groups. The reason is that 
> NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] 
> [false] - this rewrite would be correct normally, but in the context of a 
> scalar subquery it breaks our count bug handling in 
> RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get 
> there, the query appears to not have the count bug - it looks the same as if 
> the original query had a subquery with select any_value(false) from r..., and 
> that case is _not_ subject to the count bug.
>  
> Postgres comparison show correct always-false result: 
> [http://sqlfiddle.com/#!17/67822/5]
> DDL for the example:
> {code:java}
> create or replace temp view t0 (a, b)
> as values
>     (1, 1.0),
>     (2, 2.0);
> create or replace temp view t1 (c, d)
> as values
>     (2, 3.0); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`

2023-04-16 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43156:
--
Description: 
Example query:
{code:java}
spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) from 
t0").collect()
res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  
{code}
In this subquery, count(1) always evaluates to a non-null integer value, so 
count(1) is null is always false. The correct evaluation of the subquery is 
always false.

We incorrectly evaluate it to null for empty groups. The reason is that 
NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] 
[false] - this rewrite would be correct normally, but in the context of a 
scalar subquery it breaks our count bug handling in 
RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, 
the query appears to not have the count bug - it looks the same as if the 
original query had a subquery with select any_value(false) from r..., and that 
case is _not_ subject to the count bug.

 

Postgres comparison show correct always-false result: 
[http://sqlfiddle.com/#!17/67822/5]

DDL:
{code:java}
create or replace temp view t0 (a, b)
as values
    (1, 1.0),
    (2, 2.0);
create or replace temp view t1 (c, d)
as values
    (2, 3.0); {code}

  was:
Example query:
{code:java}
spark.sql("select *, (select (count(*)) is null from t1 where t0.a = t1.c) from 
t0").collect()
res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  
{code}
In this subquery, count(*) always evaluates to a non-null integer value, so 
count(*) is null is always false. The correct evaluation of the subquery is 
always false.

We incorrectly evaluate it to null for empty groups. The reason is that 
NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] 
[false] - this rewrite would be correct normally, but in the context of a 
scalar subquery it breaks our count bug handling in 
RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, 
the query appears to not have the count bug - it looks the same as if the 
original query had a subquery with select any_value(false) from r..., and that 
case is _not_ subject to the count bug.

 

Postgres comparison show correct always-false result: 
[http://sqlfiddle.com/#!17/67822/5]

DDL:
{code:java}
create or replace temp view t0 (a, b)
as values
    (1, 1.0),
    (2, 2.0);
create or replace temp view t1 (c, d)
as values
    (2, 3.0); {code}


> Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
> 
>
> Key: SPARK-43156
> URL: https://issues.apache.org/jira/browse/SPARK-43156
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> Example query:
> {code:java}
> spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) 
> from t0").collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  
> {code}
> In this subquery, count(1) always evaluates to a non-null integer value, so 
> count(1) is null is always false. The correct evaluation of the subquery is 
> always false.
> We incorrectly evaluate it to null for empty groups. The reason is that 
> NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] 
> [false] - this rewrite would be correct normally, but in the context of a 
> scalar subquery it breaks our count bug handling in 
> RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get 
> there, the query appears to not have the count bug - it looks the same as if 
> the original query had a subquery with select any_value(false) from r..., and 
> that case is _not_ subject to the count bug.
>  
> Postgres comparison show correct always-false result: 
> [http://sqlfiddle.com/#!17/67822/5]
> DDL:
> {code:java}
> create or replace temp view t0 (a, b)
> as values
>     (1, 1.0),
>     (2, 2.0);
> create or replace temp view t1 (c, d)
> as values
>     (2, 3.0); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`

2023-04-16 Thread Jack Chen (Jira)
Jack Chen created SPARK-43156:
-

 Summary: Correctness COUNT bug in correlated scalar subselect with 
`COUNT(*) is null`
 Key: SPARK-43156
 URL: https://issues.apache.org/jira/browse/SPARK-43156
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jack Chen


Example query:
{code:java}
spark.sql("select *, (select (count(*)) is null from t1 where t0.a = t1.c) from 
t0").collect()
res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  
{code}
In this subquery, count(*) always evaluates to a non-null integer value, so 
count(*) is null is always false. The correct evaluation of the subquery is 
always false.

We incorrectly evaluate it to null for empty groups. The reason is that 
NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] 
[false] - this rewrite would be correct normally, but in the context of a 
scalar subquery it breaks our count bug handling in 
RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get there, 
the query appears to not have the count bug - it looks the same as if the 
original query had a subquery with select any_value(false) from r..., and that 
case is _not_ subject to the count bug.

 

Postgres comparison show correct always-false result: 
[http://sqlfiddle.com/#!17/67822/5]

DDL:
{code:java}
create or replace temp view t0 (a, b)
as values
    (1, 1.0),
    (2, 2.0);
create or replace temp view t1 (c, d)
as values
    (2, 3.0); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43098) Should not handle the COUNT bug when the GROUP BY clause of a correlated scalar subquery is non-empty

2023-04-11 Thread Jack Chen (Jira)
Jack Chen created SPARK-43098:
-

 Summary: Should not handle the COUNT bug when the GROUP BY clause 
of a correlated scalar subquery is non-empty
 Key: SPARK-43098
 URL: https://issues.apache.org/jira/browse/SPARK-43098
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Jack Chen


>From [~allisonwang-db] :

There is no COUNT bug when the correlated equality predicates are also in the 
group by clause. However, the current logic to handle the COUNT bug still adds 
default aggregate function value and returns incorrect results.

 
{code:java}
create view t1(c1, c2) as values (0, 1), (1, 2);
create view t2(c1, c2) as values (0, 2), (0, 3);

select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from 
t1;

-- Correct answer: [(0, 1, 2), (1, 2, null)]
+---+---+--+
|c1 |c2 |scalarsubquery(c1)|
+---+---+--+
|0  |1  |2 |
|1  |2  |0 |
+---+---+--+
 {code}
 

This bug affects scalar subqueries in RewriteCorrelatedScalarSubquery, but 
lateral subqueries handle it correctly in DecorrelateInnerQuery. Related: 
https://issues.apache.org/jira/browse/SPARK-36113 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2901) it's a test

2014-08-07 Thread Jack chen (JIRA)
Jack chen created SPARK-2901:


 Summary: it's a test
 Key: SPARK-2901
 URL: https://issues.apache.org/jira/browse/SPARK-2901
 Project: Spark
  Issue Type: Bug
Reporter: Jack chen
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org