[ 
https://issues.apache.org/jira/browse/SPARK-28619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17934433#comment-17934433
 ] 

Surbhi Vijayvargeeya commented on SPARK-28619:
----------------------------------------------

[~yumwang] was there any investigation on why the results are different?

I am also seeing different results in "Kurtosis" & "Variance" when there are 
"nulls" present in the data.
{code:java}
-- With NULL
CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES
(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, 
null)
AS testData(a, b); 
SELECT KURTOSIS(a), mean(a), avg(a), variance(a), skewness(a) from testData 
FROM testData;

Result:
[-1.5069204152249134,2.142857142857143,2.142857142857143,0.8095238095238094,-0.2723801058145729]


-- Without NULL
CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 
2), (2, 1), (2, 2), (3, 1), (3, 2), (3, null) AS testData(a, b); 
SELECT KURTOSIS(a), mean(a), avg(a), variance(a), skewness(a) from testData 
FROM testData;

[-1.5069204152249138,2.142857142857143,2.142857142857143,0.8095238095238096,-0.2723801058145729]{code}
Based on the code and public docs, NULLs should not affect the result in 
aggregate functions.

> List all cases where the golden result file is different from spark-sql
> -----------------------------------------------------------------------
>
>                 Key: SPARK-28619
>                 URL: https://issues.apache.org/jira/browse/SPARK-28619
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Yuming Wang
>            Priority: Major
>
> List all cases where the golden result file is different from {{spark-sql}}.
> Case 1 from {{pgSQL/aggregates_part1.sql}}:
> {code:sql}
> SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
> FROM (VALUES (7000000000005), (7000000000007)) v(x);
> {code}
> {noformat}
> -- spark-sql
> spark-sql> SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
>          > FROM (VALUES (100000003), (100000004), (100000006), (100000007)) 
> v(x);
> 1.00000005E8  2.5000000049670534
> -- Our golden result file
> SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
> FROM (VALUES (100000003), (100000004), (100000006), (100000007)) v(x)
> -- !query 33 schema
> struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double>
> -- !query 33 output
> 1.00000005E8  2.5
> {noformat}
> Case 2 from {{group-by.sql}}:
> {code:sql}
> CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES
> (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, 
> null)
> AS testData(a, b);
> SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), VARIANCE(a), 
> STDDEV(a), SUM(a), COUNT(a)
> FROM testData;
> {code}
> {noformat}
> -- spark-sql
> spark-sql> SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), 
> VARIANCE(a), STDDEV(a), SUM(a), COUNT(a)
>          > FROM testData;
> -0.2723801058145728   -1.5069204152249136     1       3       
> 2.142857142857143       0.8095238095238094      0.8997354108424372      15    
>   7
> -- Our golden result file
> SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), VARIANCE(a), 
> STDDEV(a), SUM(a), COUNT(a)
> FROM testData
> -- !query 13 schema
> struct<skewness(CAST(a AS DOUBLE)):double,kurtosis(CAST(a AS 
> DOUBLE)):double,min(a):int,max(a):int,avg(a):double,var_samp(CAST(a AS 
> DOUBLE)):double,stddev_samp(CAST(a AS 
> DOUBLE)):double,sum(a):bigint,count(a):bigint>
> -- !query 13 output
> -0.2723801058145729   -1.5069204152249134     1       3       
> 2.142857142857143       0.8095238095238094      0.8997354108424372      15    
>   7
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to