[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-24 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-578350299
 
 
   When you do `SELECT` query, how do you feel if you get a result set with one 
missing row sometimes?
   > To me correctness means you get the wrong data back out
   
   For the following statement, do you feel the following query is 
counter-intuitive? SQL should return `0` for `COUNT(*)` when there is no rows. 
If you feel that's correct, why not on `GROUPING SET grand total`?
   > the 0 to me seems counter intuitive anyway
   ```
   spark-sql> select sum(a), count(*) from (select 1 a where false);
   NULL 0
   ```
   
   We are discussing now because of the following.
   > If there are other people that disagree then it obviously needs 
discussion, 
   
   `GROUPING SETS` is a commonly used expression for analytics queries 
(including AB testings). You may not hit this query if your workload doesn't 
have that. I understand your situation, but please don't overlook the other 
people situation. We have queries.
   >  if someone can give me a concrete example that this caused them $$ lost 
that might change my mind.
   
   For $$, as you see in this PR, by definition, if Spark doesn't give a `grand 
total`, we need to run another separate query to get the value in this step. It 
means we need to check whether the grand total row exists or not and maintain a 
full complex query for this one missing `() /*grand total*/`. It's a 
maintenance cost. (I'll not argue about the computing resource cost because the 
current implementation of this PR is not optimized yet.)
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-24 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-578350299
 
 
   When you do `SELECT` query, how do you feel if you get a result set with one 
missing row sometimes?
   > To me correctness means you get the wrong data back out
   
   For the following statement, do you feel the following query is 
counter-intuitive? SQL should return `0` for `COUNT(*)` when there is no rows. 
If you feel that's correct, why not on `GROUPING SET grand total`?
   > the 0 to me seems counter intuitive anyway
   ```
   spark-sql> select sum(a), count(*) from (select 1 a where false);
   NULL 0
   ```
   
   We are discussing now because of the following.
   > If there are other people that disagree then it obviously needs 
discussion, 
   
   `GROUPING SETS` is a commonly used expression for analytics queries 
(including AB testings). You may not hit this query if your workload doesn't 
like that. I understand your situation, but please don't overlook the other 
people situation. We have queries.
   >  if someone can give me a concrete example that this caused them $$ lost 
that might change my mind.
   
   For $$, as you see in this PR, by definition, if Spark doesn't give a `grand 
total`, we need to run another separate query to get the value in this step. It 
means we need to check whether the grand total row exists or not and maintain a 
full complex query for this one missing `() /*grand total*/`. It's a 
maintenance cost. (I'll not argue about the computing resource cost because the 
current implementation of this PR is not optimized yet.)
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-24 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-578350299
 
 
   When you do `SELECT` query, how do you feel if you get a result set with one 
missing row always?
   > To me correctness means you get the wrong data back out
   
   For the following statement, do you feel the following query is 
counter-intuitive? SQL should return `0` for `COUNT(*)` when there is no rows. 
If you feel that's correct, why not on `GROUPING SET grand total`?
   > the 0 to me seems counter intuitive anyway
   ```
   spark-sql> select sum(a), count(*) from (select 1 a where false);
   NULL 0
   ```
   
   We are discussing now because of the following.
   > If there are other people that disagree then it obviously needs 
discussion, 
   
   `GROUPING SETS` is a commonly used expression for analytics queries 
(including AB testings). You may not hit this query if your workload doesn't 
like that. I understand your situation, but please don't overlook the other 
people situation. We have queries.
   >  if someone can give me a concrete example that this caused them $$ lost 
that might change my mind.
   
   For $$, as you see in this PR, by definition, if Spark doesn't give a `grand 
total`, we need to run another separate query to get the value in this step. It 
means we need to check whether the grand total row exists or not and maintain a 
full complex query for this one missing `() /*grand total*/`. It's a 
maintenance cost. (I'll not argue about the computing resource cost because the 
current implementation of this PR is not optimized yet.)
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-22 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-577490971
 
 
   Hi, @maropu .
   Could you update the PR description by adding the DBMS comparison comments, 
`MySQL/MsSQL/Oracle/PostgreSQL/Presto`?
   @tgravescs 's comment is a good point and many people will ask that again.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-22 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-577487224
 
 
   @tgravescs . Yes, it does. However, this is correctly a **SQL correctness** 
issue.
   
   Technically, among them, `PostgreSQL` is the only one implementing it 
according to the SQL standard. `SQL:1999` defines `()` as the `grand total` and 
the query is translated into the following roughly.
   
   ```sql
   ... (other clause)
   UNION ALL
   SELECT CAST(NULL AS DTPC1) AS CNPC1, CAST(NULL AS DTPC2) AS CNPC2, ...  # 
`grand total`
   FROM ...
   ```
   
   The test case in this PR description has been used in PostgreSQL community 
from the beginning of its `Support GROUPING SETS, CUBE and ROLLUP` commit (May 
15, 2015)
   - 
https://github.com/postgres/postgres/commit/f3d3118532175541a9a96ed78881a3b04a057128#diff-3472ecc50256022c66e79b1aad4075d2R78
   
   As another DBMS, `Presto` supports this like `PostgreSQL`.
   ```
   presto:default> create table gstest_empty (a integer, b integer, v integer);
   CREATE TABLE
   presto:default> select a, b, sum(v), count(*) from gstest_empty group by 
grouping sets ((a,b),());
 a   |  b   | _col2 | _col3
   --+--+---+---
NULL | NULL |  NULL | 0
   (1 row)
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-22 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-577487224
 
 
   @tgravescs . Yes, it does. However, this is correctly a **SQL correctness** 
issue.
   
   Technically, `PostgreSQL` is the only one implementing it according to the 
SQL standard. `SQL:1999` defines `()` as the `grand total` and the query is 
translated into the following roughly.
   
   ```sql
   ... (other clause)
   UNION ALL
   SELECT CAST(NULL AS DTPC1) AS CNPC1, CAST(NULL AS DTPC2) AS CNPC2, ...  # 
`grand total`
   FROM ...
   ```
   
   The test case in this PR description has been used in PostgreSQL community 
from the beginning of its `Support GROUPING SETS, CUBE and ROLLUP` commit (May 
15, 2015)
   - 
https://github.com/postgres/postgres/commit/f3d3118532175541a9a96ed78881a3b04a057128#diff-3472ecc50256022c66e79b1aad4075d2R78


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-22 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-577487224
 
 
   @tgravescs . Yes, it does. However, this is correctly a **SQL correctness** 
issue.
   
   Technically, `PostgreSQL` is the only one implementing it according to the 
SQL standard. `SQL:1999` defines `()` as the grand total and the query is 
translated into the following roughly.
   
   ```sql
   ... (other clause)
   UNION ALL
   SELECT CAST(NULL AS DTPC1) AS CNPC1, CAST(NULL AS DTPC2) AS CNPC2, ...  # 
`grand total`
   FROM ...
   ```
   
   The test case in this PR description has been used in PostgreSQL community 
from the beginning of its `Support GROUPING SETS, CUBE and ROLLUP` commit (May 
15, 2015)
   - 
https://github.com/postgres/postgres/commit/f3d3118532175541a9a96ed78881a3b04a057128#diff-3472ecc50256022c66e79b1aad4075d2R78


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-22 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-577487224
 
 
   @tgravescs . Yes, it does. However, this is correctly a SQL correctness 
issue.
   
   Technically, `PostgreSQL` is the only one implementing it according to the 
SQL standard. `SQL:1999` defines `()` as the grand total and the query is 
translated into the following roughly.
   
   ```sql
   ... (other clause)
   UNION ALL
   SELECT CAST(NULL AS DTPC1) AS CNPC1, CAST(NULL AS DTPC2) AS CNPC2, ...  # 
`grand total`
   FROM ...
   ```
   
   The test case in this PR description has been used in PostgreSQL community 
from the beginning of its `Support GROUPING SETS, CUBE and ROLLUP` commit (May 
15, 2015)
   - 
https://github.com/postgres/postgres/commit/f3d3118532175541a9a96ed78881a3b04a057128#diff-3472ecc50256022c66e79b1aad4075d2R78


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-22 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-577487224
 
 
   @tgravescs . Yes, it does. However, this is correctly a SQL **correctness** 
issue.
   
   Technically, `PostgreSQL` is the only one implementing it according to the 
SQL standard. `SQL:1999` defines `()` as the grand total and the query is 
translated into the following roughly.
   
   ```sql
   ... (other clause)
   UNION ALL
   SELECT CAST(NULL AS DTPC1) AS CNPC1, CAST(NULL AS DTPC2) AS CNPC2, ...  # 
`grand total`
   FROM ...
   ```
   
   The test case in this PR description has been used in PostgreSQL community 
from the beginning of its `Support GROUPING SETS, CUBE and ROLLUP` commit (May 
15, 2015)
   - 
https://github.com/postgres/postgres/commit/f3d3118532175541a9a96ed78881a3b04a057128#diff-3472ecc50256022c66e79b1aad4075d2R78


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-22 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-577487224
 
 
   @tgravescs . Yes, it does. However, this is correctly a SQL correctness 
issue.
   
   Technically, `PostgreSQL` is the only one implementing it according to the 
SQL standard. `SQL:1999` defines `()` as the grand total and the query is 
translated into the following roughly.
   
   ```sql
   ... (other clause)
   UNION ALL
   SELECT CAST(NULL AS DTPC1) AS CNPC1, CAST(NULL AS DTPC2) AS CNPC2, ...  # 
`grand total`
   FROM ...
   ```
   
   This test case is used in PostgreSQL from the beginning of its `Support 
GROUPING SETS, CUBE and ROLLUP` commit (May 15, 2015)
   - 
https://github.com/postgres/postgres/commit/f3d3118532175541a9a96ed78881a3b04a057128#diff-3472ecc50256022c66e79b1aad4075d2R78


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-22 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-577487224
 
 
   @tgravescs . Yes, it does. However, this is correctly a SQL correctness 
issue.
   
   Technically, `PostgreSQL` is the only one implementing it according to the 
SQL standard. `SQL:1999` defines `()` as the grand total and the query is 
translated into the following roughly.
   
   ```sql
   ... (other clause)
   UNION ALL
   SELECT CAST(NULL AS DTPC1) AS CNPC1, CAST(NULL AS DTPC2) AS CNPC2, ...  # 
`grand total`
   FROM ...
   ```
   
   The test case in this PR description is used in PostgreSQL from the 
beginning of its `Support GROUPING SETS, CUBE and ROLLUP` commit (May 15, 2015)
   - 
https://github.com/postgres/postgres/commit/f3d3118532175541a9a96ed78881a3b04a057128#diff-3472ecc50256022c66e79b1aad4075d2R78


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-22 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-577487224
 
 
   @tgravescs . Yes, it does. However, this is correctly a SQL correctness 
issue.
   
   Technically, `PostgreSQL` is the only one implementing it according to the 
SQL standard. `SQL:1999` defines `()` as the grand total and the query is 
translated into the following roughly.
   
   ```sql
   ... (other clause)
   UNION ALL
   SELECT CAST(NULL AS DTPC1) AS CNPC1, CAST(NULL AS DTPC2) AS CNPC2, ...  # 
`grand total`
   FROM ...
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-20 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-576513755
 
 
   Thank you for keeping working on this. I sent an email (to dev@spark) about 
the current status of RC2. I'll hold on the RC2 until we decided what to do for 
this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct behaviours of group analytical queries when empty input given

2020-01-20 Thread GitBox
dongjoon-hyun edited a comment on issue #27233: [SPARK-29701][SQL] Correct 
behaviours of group analytical queries when empty input given
URL: https://github.com/apache/spark/pull/27233#issuecomment-576513755
 
 
   Thank you for keeping working on this. I sent an email (to dev@spark) of the 
current status of RC2. I'll hold on the RC2 until we decided what to do for 
this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org