[ https://issues.apache.org/jira/browse/SPARK-28619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17934433#comment-17934433 ]
Surbhi Vijayvargeeya commented on SPARK-28619: ---------------------------------------------- [~yumwang] was there any investigation on why the results are different? I am also seeing different results in "Kurtosis" & "Variance" when there are "nulls" present in the data. {code:java} -- With NULL CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b); SELECT KURTOSIS(a), mean(a), avg(a), variance(a), skewness(a) from testData FROM testData; Result: [-1.5069204152249134,2.142857142857143,2.142857142857143,0.8095238095238094,-0.2723801058145729] -- Without NULL CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (3, null) AS testData(a, b); SELECT KURTOSIS(a), mean(a), avg(a), variance(a), skewness(a) from testData FROM testData; [-1.5069204152249138,2.142857142857143,2.142857142857143,0.8095238095238096,-0.2723801058145729]{code} Based on the code and public docs, NULLs should not affect the result in aggregate functions. > List all cases where the golden result file is different from spark-sql > ----------------------------------------------------------------------- > > Key: SPARK-28619 > URL: https://issues.apache.org/jira/browse/SPARK-28619 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Yuming Wang > Priority: Major > > List all cases where the golden result file is different from {{spark-sql}}. > Case 1 from {{pgSQL/aggregates_part1.sql}}: > {code:sql} > SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) > FROM (VALUES (7000000000005), (7000000000007)) v(x); > {code} > {noformat} > -- spark-sql > spark-sql> SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) > > FROM (VALUES (100000003), (100000004), (100000006), (100000007)) > v(x); > 1.00000005E8 2.5000000049670534 > -- Our golden result file > SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) > FROM (VALUES (100000003), (100000004), (100000006), (100000007)) v(x) > -- !query 33 schema > struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> > -- !query 33 output > 1.00000005E8 2.5 > {noformat} > Case 2 from {{group-by.sql}}: > {code:sql} > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b); > SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), VARIANCE(a), > STDDEV(a), SUM(a), COUNT(a) > FROM testData; > {code} > {noformat} > -- spark-sql > spark-sql> SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), > VARIANCE(a), STDDEV(a), SUM(a), COUNT(a) > > FROM testData; > -0.2723801058145728 -1.5069204152249136 1 3 > 2.142857142857143 0.8095238095238094 0.8997354108424372 15 > 7 > -- Our golden result file > SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), VARIANCE(a), > STDDEV(a), SUM(a), COUNT(a) > FROM testData > -- !query 13 schema > struct<skewness(CAST(a AS DOUBLE)):double,kurtosis(CAST(a AS > DOUBLE)):double,min(a):int,max(a):int,avg(a):double,var_samp(CAST(a AS > DOUBLE)):double,stddev_samp(CAST(a AS > DOUBLE)):double,sum(a):bigint,count(a):bigint> > -- !query 13 output > -0.2723801058145729 -1.5069204152249134 1 3 > 2.142857142857143 0.8095238095238094 0.8997354108424372 15 > 7 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org