[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions

2019-12-22 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002122#comment-17002122
 ] 

Takeshi Yamamuro commented on SPARK-29699:
--

A root cause is the NULL order difference in order-by clauses; in Spark, NULLS 
FIRST is the default for an ascending order, but in PostgreSQL/Oracle, NULLS 
LAST is the default for that. So, if we explicitly set the order in the example 
query above, we can get the same answer with PostgreSQL;
{code:java}
sql("""select a, b, sum(c), sum(sum(c)) over (order by a asc nulls last, b asc 
nulls last) as rsum
  from gstest2 group by rollup (a,b) order by rsum asc nulls last, a asc nulls 
last, b asc nulls last""").show()
+++--++ 
|   a|   b|sum(c)|rsum|
+++--++
|   1|   1| 8|   8|
|   1|   2| 2|  10|
|   1|null|10|  20|
|   2|   2| 2|  22|
|   2|null| 2|  24|
|null|null|12|  36|
+++--++
{code}
Currently, it seems we follow the MySQL/SQL Server behaviour and they have 
NULLS FIRST by default.
 Any historical reason for our default NULL order? cc: [~smilegator] 
[~cloud_fan] [~viirya]

Changing the default behaivour in Spark has some impacts on test output in 
SQLQueryTestSuite:
 [https://github.com/apache/spark/compare/master...maropu:NullLastByDefault]

 

References:
 Cited from the PostgreSQL doc: 
[https://www.postgresql.org/docs/current/queries-order.html]
{code:java}
By default, null values sort as if larger than any non-null value; that is, 
NULLS FIRST is the default for DESC order, and NULLS LAST otherwise.
{code}
Cited from the OracleDB doc: 
[https://docs.oracle.com/database/121/SQLRF/statements_10002.htm#SQLRF01702]
{code:java}
NULLS LAST is the default for ascending order, and NULLS FIRST is the default 
for descending order.
{code}
Cited from the SQL server: 
[https://docs.microsoft.com/en-us/sql/t-sql/queries/select-order-by-clause-transact-sql?redirectedfrom=MSDN&view=sql-server-ver15]
{code:java}
 ASC is the default sort order. Null values are treated as the lowest possible 
values.
{code}
Cited from the MySQL: 
[https://dev.mysql.com/doc/refman/5.7/en/working-with-null.html]
{code:java}
When doing an ORDER BY, NULL values are presented first if you do ORDER BY ... 
ASC and last if you do ORDER BY ... DESC.
{code}

> Different answers in nested aggregates with window functions
> 
>
> Key: SPARK-29699
> URL: https://issues.apache.org/jira/browse/SPARK-29699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>  Labels: correctness
>
> A nested aggregate below with a window function seems to have different 
> answers in the `rsum` column  between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest2 (a integer, b integer, c integer, d integer, 
> e integer, f integer, g integer, h integer);
> postgres=# insert into gstest2 values
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
> postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
> postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
> postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
> postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
> postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
> postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
> postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
> INSERT 0 9
> postgres=# 
> postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
> postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
>  a | b | sum | rsum 
> ---+---+-+--
>  1 | 1 |   8 |8
>  1 | 2 |   2 |   10
>  1 |   |  10 |   20
>  2 | 2 |   2 |   22
>  2 |   |   2 |   24
>|   |  12 |   36
> (6 rows)
> {code}
> {code:java}
> scala> sql("""
>  | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
>  |   from gstest2 group by rollup (a,b) order by rsum, a, b
>  | """).show()
> +++--++   
>   
> |   a|   b|sum(c)|rsum|
> +++--++
> |null|null|12|  12|
> |   1|null|10|  22|
> |   1|   1| 8|  30|
> |   1|   2| 2|  32|
> |   2|null| 2|  34|
> |   2|   2| 2|  36|
> +++--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions

2020-01-22 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021404#comment-17021404
 ] 

Thomas Graves commented on SPARK-29699:
---

this seems to be in the context of feature parity with postgres, we aren't 
doing that now and you say we do the same as mysql, if that is the case I would 
argue this doesn't seem like a correctness issue but a compatibility issue.  
thoughts?

> Different answers in nested aggregates with window functions
> 
>
> Key: SPARK-29699
> URL: https://issues.apache.org/jira/browse/SPARK-29699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>  Labels: correctness
>
> A nested aggregate below with a window function seems to have different 
> answers in the `rsum` column  between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest2 (a integer, b integer, c integer, d integer, 
> e integer, f integer, g integer, h integer);
> postgres=# insert into gstest2 values
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
> postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
> postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
> postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
> postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
> postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
> postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
> postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
> INSERT 0 9
> postgres=# 
> postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
> postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
>  a | b | sum | rsum 
> ---+---+-+--
>  1 | 1 |   8 |8
>  1 | 2 |   2 |   10
>  1 |   |  10 |   20
>  2 | 2 |   2 |   22
>  2 |   |   2 |   24
>|   |  12 |   36
> (6 rows)
> {code}
> {code:java}
> scala> sql("""
>  | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
>  |   from gstest2 group by rollup (a,b) order by rsum, a, b
>  | """).show()
> +++--++   
>   
> |   a|   b|sum(c)|rsum|
> +++--++
> |null|null|12|  12|
> |   1|null|10|  22|
> |   1|   1| 8|  30|
> |   1|   2| 2|  32|
> |   2|null| 2|  34|
> |   2|   2| 2|  36|
> +++--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions

2020-02-19 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039783#comment-17039783
 ] 

Hyukjin Kwon commented on SPARK-29699:
--

I lowered to Critical+ for now.

> Different answers in nested aggregates with window functions
> 
>
> Key: SPARK-29699
> URL: https://issues.apache.org/jira/browse/SPARK-29699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Critical
>
> A nested aggregate below with a window function seems to have different 
> answers in the `rsum` column  between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest2 (a integer, b integer, c integer, d integer, 
> e integer, f integer, g integer, h integer);
> postgres=# insert into gstest2 values
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
> postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
> postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
> postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
> postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
> postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
> postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
> postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
> INSERT 0 9
> postgres=# 
> postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
> postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
>  a | b | sum | rsum 
> ---+---+-+--
>  1 | 1 |   8 |8
>  1 | 2 |   2 |   10
>  1 |   |  10 |   20
>  2 | 2 |   2 |   22
>  2 |   |   2 |   24
>|   |  12 |   36
> (6 rows)
> {code}
> {code:java}
> scala> sql("""
>  | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
>  |   from gstest2 group by rollup (a,b) order by rsum, a, b
>  | """).show()
> +++--++   
>   
> |   a|   b|sum(c)|rsum|
> +++--++
> |null|null|12|  12|
> |   1|null|10|  22|
> |   1|   1| 8|  30|
> |   1|   2| 2|  32|
> |   2|null| 2|  34|
> |   2|   2| 2|  36|
> +++--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions

2020-12-22 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253393#comment-17253393
 ] 

jiaan.geng commented on SPARK-29699:


The behavior of various mainstream databases is different, it seems that there 
is no need to force which one.

> Different answers in nested aggregates with window functions
> 
>
> Key: SPARK-29699
> URL: https://issues.apache.org/jira/browse/SPARK-29699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Critical
>
> A nested aggregate below with a window function seems to have different 
> answers in the `rsum` column  between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest2 (a integer, b integer, c integer, d integer, 
> e integer, f integer, g integer, h integer);
> postgres=# insert into gstest2 values
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
> postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
> postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
> postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
> postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
> postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
> postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
> postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
> INSERT 0 9
> postgres=# 
> postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
> postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
>  a | b | sum | rsum 
> ---+---+-+--
>  1 | 1 |   8 |8
>  1 | 2 |   2 |   10
>  1 |   |  10 |   20
>  2 | 2 |   2 |   22
>  2 |   |   2 |   24
>|   |  12 |   36
> (6 rows)
> {code}
> {code:java}
> scala> sql("""
>  | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
>  |   from gstest2 group by rollup (a,b) order by rsum, a, b
>  | """).show()
> +++--++   
>   
> |   a|   b|sum(c)|rsum|
> +++--++
> |null|null|12|  12|
> |   1|null|10|  22|
> |   1|   1| 8|  30|
> |   1|   2| 2|  32|
> |   2|null| 2|  34|
> |   2|   2| 2|  36|
> +++--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions

2021-06-29 Thread shubhangi priya (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371408#comment-17371408
 ] 

shubhangi priya commented on SPARK-29699:
-

[~maropu] Hi I want to work on this issue.
If it is unresolved can I work upon it?

> Different answers in nested aggregates with window functions
> 
>
> Key: SPARK-29699
> URL: https://issues.apache.org/jira/browse/SPARK-29699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Critical
>
> A nested aggregate below with a window function seems to have different 
> answers in the `rsum` column  between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest2 (a integer, b integer, c integer, d integer, 
> e integer, f integer, g integer, h integer);
> postgres=# insert into gstest2 values
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
> postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
> postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
> postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
> postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
> postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
> postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
> postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
> INSERT 0 9
> postgres=# 
> postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
> postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
>  a | b | sum | rsum 
> ---+---+-+--
>  1 | 1 |   8 |8
>  1 | 2 |   2 |   10
>  1 |   |  10 |   20
>  2 | 2 |   2 |   22
>  2 |   |   2 |   24
>|   |  12 |   36
> (6 rows)
> {code}
> {code:java}
> scala> sql("""
>  | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
>  |   from gstest2 group by rollup (a,b) order by rsum, a, b
>  | """).show()
> +++--++   
>   
> |   a|   b|sum(c)|rsum|
> +++--++
> |null|null|12|  12|
> |   1|null|10|  22|
> |   1|   1| 8|  30|
> |   1|   2| 2|  32|
> |   2|null| 2|  34|
> |   2|   2| 2|  36|
> +++--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org