[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions
[ https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002122#comment-17002122 ] Takeshi Yamamuro commented on SPARK-29699: -- A root cause is the NULL order difference in order-by clauses; in Spark, NULLS FIRST is the default for an ascending order, but in PostgreSQL/Oracle, NULLS LAST is the default for that. So, if we explicitly set the order in the example query above, we can get the same answer with PostgreSQL; {code:java} sql("""select a, b, sum(c), sum(sum(c)) over (order by a asc nulls last, b asc nulls last) as rsum from gstest2 group by rollup (a,b) order by rsum asc nulls last, a asc nulls last, b asc nulls last""").show() +++--++ | a| b|sum(c)|rsum| +++--++ | 1| 1| 8| 8| | 1| 2| 2| 10| | 1|null|10| 20| | 2| 2| 2| 22| | 2|null| 2| 24| |null|null|12| 36| +++--++ {code} Currently, it seems we follow the MySQL/SQL Server behaviour and they have NULLS FIRST by default. Any historical reason for our default NULL order? cc: [~smilegator] [~cloud_fan] [~viirya] Changing the default behaivour in Spark has some impacts on test output in SQLQueryTestSuite: [https://github.com/apache/spark/compare/master...maropu:NullLastByDefault] References: Cited from the PostgreSQL doc: [https://www.postgresql.org/docs/current/queries-order.html] {code:java} By default, null values sort as if larger than any non-null value; that is, NULLS FIRST is the default for DESC order, and NULLS LAST otherwise. {code} Cited from the OracleDB doc: [https://docs.oracle.com/database/121/SQLRF/statements_10002.htm#SQLRF01702] {code:java} NULLS LAST is the default for ascending order, and NULLS FIRST is the default for descending order. {code} Cited from the SQL server: [https://docs.microsoft.com/en-us/sql/t-sql/queries/select-order-by-clause-transact-sql?redirectedfrom=MSDN&view=sql-server-ver15] {code:java} ASC is the default sort order. Null values are treated as the lowest possible values. {code} Cited from the MySQL: [https://dev.mysql.com/doc/refman/5.7/en/working-with-null.html] {code:java} When doing an ORDER BY, NULL values are presented first if you do ORDER BY ... ASC and last if you do ORDER BY ... DESC. {code} > Different answers in nested aggregates with window functions > > > Key: SPARK-29699 > URL: https://issues.apache.org/jira/browse/SPARK-29699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > Labels: correctness > > A nested aggregate below with a window function seems to have different > answers in the `rsum` column between PgSQL and Spark; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# > postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; > a | b | sum | rsum > ---+---+-+-- > 1 | 1 | 8 |8 > 1 | 2 | 2 | 10 > 1 | | 10 | 20 > 2 | 2 | 2 | 22 > 2 | | 2 | 24 >| | 12 | 36 > (6 rows) > {code} > {code:java} > scala> sql(""" > | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > | from gstest2 group by rollup (a,b) order by rsum, a, b > | """).show() > +++--++ > > | a| b|sum(c)|rsum| > +++--++ > |null|null|12| 12| > | 1|null|10| 22| > | 1| 1| 8| 30| > | 1| 2| 2| 32| > | 2|null| 2| 34| > | 2| 2| 2| 36| > +++--++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions
[ https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021404#comment-17021404 ] Thomas Graves commented on SPARK-29699: --- this seems to be in the context of feature parity with postgres, we aren't doing that now and you say we do the same as mysql, if that is the case I would argue this doesn't seem like a correctness issue but a compatibility issue. thoughts? > Different answers in nested aggregates with window functions > > > Key: SPARK-29699 > URL: https://issues.apache.org/jira/browse/SPARK-29699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > Labels: correctness > > A nested aggregate below with a window function seems to have different > answers in the `rsum` column between PgSQL and Spark; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# > postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; > a | b | sum | rsum > ---+---+-+-- > 1 | 1 | 8 |8 > 1 | 2 | 2 | 10 > 1 | | 10 | 20 > 2 | 2 | 2 | 22 > 2 | | 2 | 24 >| | 12 | 36 > (6 rows) > {code} > {code:java} > scala> sql(""" > | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > | from gstest2 group by rollup (a,b) order by rsum, a, b > | """).show() > +++--++ > > | a| b|sum(c)|rsum| > +++--++ > |null|null|12| 12| > | 1|null|10| 22| > | 1| 1| 8| 30| > | 1| 2| 2| 32| > | 2|null| 2| 34| > | 2| 2| 2| 36| > +++--++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions
[ https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039783#comment-17039783 ] Hyukjin Kwon commented on SPARK-29699: -- I lowered to Critical+ for now. > Different answers in nested aggregates with window functions > > > Key: SPARK-29699 > URL: https://issues.apache.org/jira/browse/SPARK-29699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Critical > > A nested aggregate below with a window function seems to have different > answers in the `rsum` column between PgSQL and Spark; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# > postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; > a | b | sum | rsum > ---+---+-+-- > 1 | 1 | 8 |8 > 1 | 2 | 2 | 10 > 1 | | 10 | 20 > 2 | 2 | 2 | 22 > 2 | | 2 | 24 >| | 12 | 36 > (6 rows) > {code} > {code:java} > scala> sql(""" > | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > | from gstest2 group by rollup (a,b) order by rsum, a, b > | """).show() > +++--++ > > | a| b|sum(c)|rsum| > +++--++ > |null|null|12| 12| > | 1|null|10| 22| > | 1| 1| 8| 30| > | 1| 2| 2| 32| > | 2|null| 2| 34| > | 2| 2| 2| 36| > +++--++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions
[ https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253393#comment-17253393 ] jiaan.geng commented on SPARK-29699: The behavior of various mainstream databases is different, it seems that there is no need to force which one. > Different answers in nested aggregates with window functions > > > Key: SPARK-29699 > URL: https://issues.apache.org/jira/browse/SPARK-29699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Critical > > A nested aggregate below with a window function seems to have different > answers in the `rsum` column between PgSQL and Spark; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# > postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; > a | b | sum | rsum > ---+---+-+-- > 1 | 1 | 8 |8 > 1 | 2 | 2 | 10 > 1 | | 10 | 20 > 2 | 2 | 2 | 22 > 2 | | 2 | 24 >| | 12 | 36 > (6 rows) > {code} > {code:java} > scala> sql(""" > | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > | from gstest2 group by rollup (a,b) order by rsum, a, b > | """).show() > +++--++ > > | a| b|sum(c)|rsum| > +++--++ > |null|null|12| 12| > | 1|null|10| 22| > | 1| 1| 8| 30| > | 1| 2| 2| 32| > | 2|null| 2| 34| > | 2| 2| 2| 36| > +++--++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions
[ https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371408#comment-17371408 ] shubhangi priya commented on SPARK-29699: - [~maropu] Hi I want to work on this issue. If it is unresolved can I work upon it? > Different answers in nested aggregates with window functions > > > Key: SPARK-29699 > URL: https://issues.apache.org/jira/browse/SPARK-29699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Critical > > A nested aggregate below with a window function seems to have different > answers in the `rsum` column between PgSQL and Spark; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# > postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; > a | b | sum | rsum > ---+---+-+-- > 1 | 1 | 8 |8 > 1 | 2 | 2 | 10 > 1 | | 10 | 20 > 2 | 2 | 2 | 22 > 2 | | 2 | 24 >| | 12 | 36 > (6 rows) > {code} > {code:java} > scala> sql(""" > | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > | from gstest2 group by rollup (a,b) order by rsum, a, b > | """).show() > +++--++ > > | a| b|sum(c)|rsum| > +++--++ > |null|null|12| 12| > | 1|null|10| 22| > | 1| 1| 8| 30| > | 1| 2| 2| 32| > | 2|null| 2| 34| > | 2| 2| 2| 36| > +++--++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org