Jack Chen created SPARK-48501:
---------------------------------

             Summary: Loosen `correlated scalar subqueries must be aggregated` 
error by doing runtime check for scalar subqueries output rowcount
                 Key: SPARK-48501
                 URL: https://issues.apache.org/jira/browse/SPARK-48501
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 4.0.0
            Reporter: Jack Chen


Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw 
an error {{{}Correlated scalar subqueries must be aggregated{}}}.

This check is often too restrictive, there are many cases where it should 
actually be runnable even though we don’t know it - e.g. unique keys or 
functional dependencies might ensure that there's only one row.

To handle these cases, it’s better to do the check at runtime instead of 
statically. This could be implemented as a special aggregate operator that 
throws exception on >=2 rows input, a “single join” operator that throws an 
exception when >= 2 rows match, or something similar.

There are also cases where we were incorrectly allowing queries before that 
returned wrong results, and should have been rejected as invalid. Doing the 
check at runtime would help avoid those bugs.

Current workarounds: Users can add an aggregate like {{any_value()}} or 
{{first()}} to the output of the subquery, or users can add {{limit 1}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to