[jira] [Commented] (CALCITE-6324) Type inferred for result of STDDEV, VAR_SAMP, etc. is incorrect

Mihai Budiu (Jira) Wed, 13 Mar 2024 13:04:05 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-6324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17826852#comment-17826852
 ]


Mihai Budiu commented on CALCITE-6324:
--------------------------------------

Indeed, it is not clear what the precision and scale of the result should be 
for some of these functions.
The precision required to fit the result depends not only on the value type, 
but also on the table size.

If we leave the current type inference rules for these functions a program like 
the the following:

{code:sql}
select var_samp(sal) from emp;
{code}

which appears in redshift.iq, will fail with a runtime error because the result 
~1398313.87 cannot be represented in the type DECIMAL(7, 2), the type of 
EMP.SAL.

SQL server uses a higher precision even for short integer types for AVG: 
https://learn.microsoft.com/en-us/sql/t-sql/functions/avg-transact-sql?view=sql-server-ver16#return-types
SQL server also returns FLOAT for the statistical aggregates.

Postgres seems to use an "arbitrary precision" decimal for these values 
(Postgres DECIMALS can have up to billions of digits), based on testing.

Oracle uses for the result the same type as the source data type: 
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/STDDEV.html

In my PR I have used DECIMAL(2*precision, 2*scale) for the result for this 
class of functions, where DECIMAL(precision, scale) is the precision of the 
source data.
I think that a larger result type is more user-friendly, since fewer programs 
will surprisingly fail at runtime if people forget to cast the data.

If we keep the current type for the results we will have to adjust the 
accumulator types and change quite a few test programs to insert explicit casts 
before the aggregation. (I will have to check, we may need to change the 
accumulator types anyway, didn't look to see exactly how the evaluation is 
done.)

Happy for a suggestion to solve this.


> Type inferred for result of STDDEV, VAR_SAMP, etc. is incorrect
> ---------------------------------------------------------------
>
>                 Key: CALCITE-6324
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6324
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.36.0
>            Reporter: Mihai Budiu
>            Assignee: Mihai Budiu
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.37.0
>
>
> These functions are all use the same type inference algorithm, essentially 
> the algorithm used by AVG.
> But if the values processed are decimal, STDDEV (and others) need much higher 
> precision to represent the result. (I am not sure that the inference is right 
> for integer types either, btw.)
> This surfaced during the implementation of a fix for [CALCITE-6322]: if we 
> use the type inferred for these functions, the result overflows and causes a 
> runtime exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CALCITE-6324) Type inferred for result of STDDEV, VAR_SAMP, etc. is incorrect

Reply via email to