[ 
https://issues.apache.org/jira/browse/SPARK-48501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-48501:
------------------------------
    Description: 
Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw 
an error {{{}Correlated scalar subqueries must be aggregated{}}}.

This check is often too restrictive, there are many cases where it should 
actually be runnable even though we don’t know it - e.g. unique keys or 
functional dependencies might ensure that there's only one row.

To handle these cases, it’s better to do the check at runtime instead of 
statically. This could be implemented as a special aggregate operator that 
throws exception on >=2 rows input, a “single join” operator that throws an 
exception when >= 2 rows match, or something similar.

There are also cases where we were incorrectly allowing queries before that 
returned wrong results, and should have been rejected as invalid (e.g. 
SPARK-48503, SPARK-18504). Doing the check at runtime would help avoid those 
bugs.

Current workarounds: Users can add an aggregate like {{any_value()}} or 
{{first()}} to the output of the subquery, or users can add {{limit 1}}

  was:
Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw 
an error {{{}Correlated scalar subqueries must be aggregated{}}}.

This check is often too restrictive, there are many cases where it should 
actually be runnable even though we don’t know it - e.g. unique keys or 
functional dependencies might ensure that there's only one row.

To handle these cases, it’s better to do the check at runtime instead of 
statically. This could be implemented as a special aggregate operator that 
throws exception on >=2 rows input, a “single join” operator that throws an 
exception when >= 2 rows match, or something similar.

There are also cases where we were incorrectly allowing queries before that 
returned wrong results, and should have been rejected as invalid. Doing the 
check at runtime would help avoid those bugs.

Current workarounds: Users can add an aggregate like {{any_value()}} or 
{{first()}} to the output of the subquery, or users can add {{limit 1}}


> Loosen `correlated scalar subqueries must be aggregated` error by doing 
> runtime check for scalar subqueries output rowcount
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-48501
>                 URL: https://issues.apache.org/jira/browse/SPARK-48501
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Jack Chen
>            Priority: Major
>
> Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw 
> an error {{{}Correlated scalar subqueries must be aggregated{}}}.
> This check is often too restrictive, there are many cases where it should 
> actually be runnable even though we don’t know it - e.g. unique keys or 
> functional dependencies might ensure that there's only one row.
> To handle these cases, it’s better to do the check at runtime instead of 
> statically. This could be implemented as a special aggregate operator that 
> throws exception on >=2 rows input, a “single join” operator that throws an 
> exception when >= 2 rows match, or something similar.
> There are also cases where we were incorrectly allowing queries before that 
> returned wrong results, and should have been rejected as invalid (e.g. 
> SPARK-48503, SPARK-18504). Doing the check at runtime would help avoid those 
> bugs.
> Current workarounds: Users can add an aggregate like {{any_value()}} or 
> {{first()}} to the output of the subquery, or users can add {{limit 1}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to