[GitHub] [spark] dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given URL: https://github.com/apache/spark/pull/27233#issuecomment-578352825 cc @dbtsai This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given URL: https://github.com/apache/spark/pull/27233#issuecomment-578350299 When you do `SELECT` query, how do you feel if you get a result set with one missing row always? > To me correctness means you get the wrong data back out For the following statement, do you feel the following query is counter-intuitive? SQL should return `0` for `COUNT(*)` when there is no rows. If you feel that's correct, why not on `GROUPING SET grand total`? > the 0 to me seems counter intuitive anyway ``` spark-sql> select sum(a), count(*) from (select 1 a where false); NULL 0 ``` We are discussing now because of the following. > If there are other people that disagree then it obviously needs discussion, `GROUPING SETS` is a commonly used expression for analytics queries (including AB testings). You may not hit this query if your workload doesn't like that. I understand your situation, but please don't overlook the other people situation. We have queries. > if someone can give me a concrete example that this caused them $$ lost that might change my mind. For $$, as you see in this PR, by definition, if Spark doesn't give a `grand total`, we need to run another separate query to get the value in this step. It means we need to maintain a full complex query for this one missing `() /*grand total*/`. It's a maintenance cost. (I'll not argue about the computing resource cost because the current implementation of this PR is not optimized yet.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given URL: https://github.com/apache/spark/pull/27233#issuecomment-577950755 @maropu . I use Docker. For example, the following is `Oracle` docker case. ```bash ~$ docker run --name oracle -d -p 8080:8080 -p 1521:1521 store/oracle/database-enterprise:12.2.0.1 ~$ docker exec -it oracle /bin/bash [oracle@e24398d6fa62 /]$ cd $ORACLE_HOME [oracle@e24398d6fa62 dbhome_1]$ bin/sqlplus / as sysdba SQL*Plus: Release 12.2.0.1.0 Production on Fri Jan 24 01:07:04 2020 Copyright (c) 1982, 2016, Oracle. All rights reserved. Connected to: Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production SQL> create table gstest_empty (a integer, b integer, v integer); Table created. SQL> select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),()); no rows selected ``` MsSQL is the same. ```bash $ docker run --name mssql -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=Sapass123' -p 1433:1433 -d mcr.microsoft.com/mssql/server:2017-GA-ubuntu $ docker exec -it mssql /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "Sapass123" ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given URL: https://github.com/apache/spark/pull/27233#issuecomment-577882809 Thanks, @tgravescs . I'm open to all advices, but is it the same with "this is not a regression" argument? The reason I asked on the community is to set a more clear boundary for the correctness and data-loss issues. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given URL: https://github.com/apache/spark/pull/27233#issuecomment-577490971 Hi, @maropu . Could you update the PR description by adding the DBMS comparison comments? @tgravescs 's comment is a good point and many people will ask that again. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given URL: https://github.com/apache/spark/pull/27233#issuecomment-577487224 @tgravescs . This is correctly a SQL correctness issue. Technically, `PostgreSQL` is the only one implementing it according to the SQL standard. `SQL:1999` defines `()` as the grand total and the query is translated into the following roughly. ```sql ... (other clause) UNION ALL SELECT CAST(NULL AS DTPC1) AS CNPC1, CAST(NULL AS DTPC2) AS CNPC2, ... # `grand total` FROM ... ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given URL: https://github.com/apache/spark/pull/27233#issuecomment-577465815 Thanks for the confirmation, @gatorsmile . However, IIRC, `regression or not` is not a valid reason when it comes to `correctness` or `data-loss` issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given URL: https://github.com/apache/spark/pull/27233#issuecomment-576999456 Thank you, @maropu . And, ping @gatorsmile again since he casted -1 last time. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given URL: https://github.com/apache/spark/pull/27233#issuecomment-576573046 How do you think about this, @gatorsmile and @srowen ? Given the situation of 3.0.0 and 2.4.5, I believe we need to spend some time to build a new consensus in the community because it has been arguable always. If someone still say `no` for correctness or this is merged to `master` during RC2 vote, RC2 will fail again because we follow the written Apache Spark policy. I'm +1 for @cloud-fan 's suggestion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given URL: https://github.com/apache/spark/pull/27233#issuecomment-576513755 Thank you for keeping working on this. I sent an email of the current status of RC2. I'll hold on the RC2 until we decided what to do for this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given
dongjoon-hyun commented on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given URL: https://github.com/apache/spark/pull/27233#issuecomment-575439017 cc @gatorsmile and @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org