[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-04-23 Thread GitBox


cloud-fan commented on issue #27627:
URL: https://github.com/apache/spark/pull/27627#issuecomment-618337440


   Then why can't we use a simple `isEmpty` flag? Null is skipped so the 
`isEmpty` flag is true if the data are all nulls.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-04-17 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-615065652
 
 
   @skambha check the result in PostgreSQL:
   ```
   cloud0fan=# create table t(d decimal(20, 10));
   CREATE TABLE
   cloud0fan=# insert into t values (1.1);
   INSERT 0 1
   cloud0fan=# insert into t values (2.2);
   INSERT 0 1
   cloud0fan=# insert into t values (null);
   INSERT 0 1
   cloud0fan=# select sum(d) from t;
sum  
   --
3.30
   (1 row)
   ```
   
   Nulls should be skipped when calculating aggregate functions.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-04-15 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-613912451
 
 
   > So if we have 2 null rows, the expected result should be null and not zero.
   > We need a way to distinguish between valid null rows and null value that 
come because of a overflow as well.
   
   AFAIK most aggregate functions just skip null values, and `SUM` will only 
have null value if: 1. overflow happens. 2: no inputs.
   
   It's much easier to set the `isEmpty` flag, but hard to set the `isOverflow` 
flag (you added 2 new expressions)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-04-08 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-611308104
 
 
   > A function implementor would need to know to use DecimalSum instead of Sum 
and special case it.
   
   My main point is to avoid perf regression for other types. We can also put 
everything in `Sum`, and use `if (is decimal type)` to generate different 
update/merge expressions for decimal and other types.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-03-11 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-598002388
 
 
   > Is the expectation that the user would now call this from their queries 
for their decimal sum operations?
   
   No, it's just a new expression. Analyzer is responsible for converting `Sum` 
with decimal type input to `DecimalSum`.
   
   > Won't we also need to check for overflow in here as well.
   
   ah yes, thanks for catching it!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-03-10 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-597440834
 
 
   @skambha great analysis!
   
   I agree with you that we need another boolean flag in the sum aggregate 
buffer, but I'd like to make it simpler and only change it for decimals.
   
   How about we add a new expression `DecimalSum`? In which:
   1. the buffer attributes are [sum, isEmpty]
   2. initial value is [0, true]
   3. the `updateExpression` should do:
   3.1 update `isEmpty` to false
   3.2 set `sum` to null if overflowed
   3.3 do nothing if `sum` is already null.
   4. the `mergeExpression` should do:
   4.1 update `isEmpty` to false
   4.2 if the input buffer's `isEmpty` is true, keep sum unchanged
   4.3 if the input buffer's `isEmpty` is false, but `sum` is null, update its 
own `sum` to null
   4.4 do nothing if `sum` is already null.
   4.5 otherwise, add input buffer's `sum`
   5. the `evaluateExpression` should do:
   5.1 output null if `isEmpty` is true
   5.2 fail if `sum` is null and ansi mode is on
   5.3 otherwise, output the sum.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-03-10 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-596918804
 
 
   @skambha thanks for your investigation! Very helpful.
   
   I'm trying to determine the scope of this bug. Your analysis is for global 
aggregate (no GROUP BY), when we only write out the sum value at the end of 
partial aggregate. I tried with group by
   ```
   df.select(expr(s"cast('$decimalStr' as decimal (38, 18)) as d"), 
lit(1).as("key"))
   .groupBy("key").agg(sum($"d")).show
   java.lang.ArithmeticException: Decimal precision 39 exceeds max precision 38
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:122)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574)
at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:393)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doConsume_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   ...
   ```
   
   Do you what happens there? I'd expect we write overflowed decimal as null to 
the aggregate hash map, but it seems not the case.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-03-05 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-595600181
 
 
   > Addition of two decimal values ( expression coming from sum) that results 
in value that cannot be contained.
   
   You can try `Decimal.+(Decimal)` locally, it does return a value that is not 
null. We can't hold an overflowed decimal value in unsafe row, but a `Decimal` 
object can be temporary overflowed.
   
   In my repro, `spark.range(0, 12, 1, 1)` works fine and `spark.range(0, 1, 1, 
1).union(spark.range(0, 11, 1, 1))` gives wrong result. I looked at the code 
again and whole-stage-codegen also stores the partial aggregate result to 
unsafe row. Can someone investigate it further and see why `spark.range(0, 12, 
1, 1)` works?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-03-05 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-595330926
 
 
   Note: the semantic of most aggregate functions are skipping nulls. So sum 
will skip null inputs, and overflow doesn't return null (also true for 
decimal), so `updateExpressions` is fine.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-03-05 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-595319742
 
 
   > looks like the overflow happens during partial aggregate ( 
updateExpressions and mergeExpressions)
   
   partial aggregate does `updateExpressions` and produces buffer, final 
aggregate does `mergeExpressions` to merge buffers. The problem here is, 
`mergeExpressions` treats null as 0.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-03-05 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-595112843
 
 
   Ideally, under ANSI mode, decimal sum should also fail if overflows. It's 
hard to fail at the partial aggregate side, as we don't have a chance to check 
overflow before shuffling the aggregate buffer row. We can fail at the final 
aggregate side: If the value is null and `isEmpty` is false, fail with the 
overflow exception.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-03-04 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-595067684
 
 
   cc @viirya @maropu as well


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-03-04 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-595067541
 
 
   find a way to reproduce without join
   ```
   scala> val decimalStr = "1" + "0" * 19
   decimalStr: String = 1000
   
   scala> val df = spark.range(0, 12, 1, 1)
   df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> df.select(expr(s"cast('$decimalStr' as decimal (38, 18)) as 
d")).agg(sum($"d")).show
   // This is correct
   +--+
   |sum(d)|
   +--+
   |  null|
   +--+
   
   scala> val df = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
   df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> df.select(expr(s"cast('$decimalStr' as decimal (38, 18)) as 
d")).agg(sum($"d")).show
   // This is wrong
   ++
   |  sum(d)|
   ++
   |1...|
   ++
   ```
   
   I think the root cause is, `sum` in partial aggregate overflows and write 
null to the unsafe row. `sum` in final aggregate reads null from the unsafe row 
and mistakenly think it's caused by empty data and convert it to 0.
   
   We should create a `DecimalSum`, which use 2 buffer attributes: `sum` and 
`isEmpty`. Then in final aggregate we can check the `isEmpty` flag to konw if 
the null is caused by overflow or empty data.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-03-04 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-594644908
 
 
   I tried some queries locally but can't reproduce. Is it only a problem with 
a join?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-03-03 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-594010285
 
 
   > In case of the whole stage codegen, you can see the decimal ‘+’ 
expressions will at some point be larger than what can be contained in dec 
38,18 but it gets written out as null. This messes up the end result of the sum 
and you get wrong results.
   
   Still a bit confused. If it gets written out as null, then null + decimal 
always return null and the final result is null?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-03-03 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-593977066
 
 
   OK to test


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow

2020-02-20 Thread GitBox
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect 
results for decimal aggregate sum by returning null on decimal overflow
URL: https://github.com/apache/spark/pull/27627#issuecomment-589530921
 
 
   > Sum does not take care of possibility of overflow for the intermediate 
steps. ie the updateExpressions and mergeExpressions.
   
   I'm a little confused. These expressions are used in non-whole-stage-codegen 
as well, why only whole-stage-codegen has the problem? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org