[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: URL: https://github.com/apache/spark/pull/27627#issuecomment-618337440 Then why can't we use a simple `isEmpty` flag? Null is skipped so the `isEmpty` flag is true if the data are all nulls. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-615065652 @skambha check the result in PostgreSQL: ``` cloud0fan=# create table t(d decimal(20, 10)); CREATE TABLE cloud0fan=# insert into t values (1.1); INSERT 0 1 cloud0fan=# insert into t values (2.2); INSERT 0 1 cloud0fan=# insert into t values (null); INSERT 0 1 cloud0fan=# select sum(d) from t; sum -- 3.30 (1 row) ``` Nulls should be skipped when calculating aggregate functions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-613912451 > So if we have 2 null rows, the expected result should be null and not zero. > We need a way to distinguish between valid null rows and null value that come because of a overflow as well. AFAIK most aggregate functions just skip null values, and `SUM` will only have null value if: 1. overflow happens. 2: no inputs. It's much easier to set the `isEmpty` flag, but hard to set the `isOverflow` flag (you added 2 new expressions) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-611308104 > A function implementor would need to know to use DecimalSum instead of Sum and special case it. My main point is to avoid perf regression for other types. We can also put everything in `Sum`, and use `if (is decimal type)` to generate different update/merge expressions for decimal and other types. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-598002388 > Is the expectation that the user would now call this from their queries for their decimal sum operations? No, it's just a new expression. Analyzer is responsible for converting `Sum` with decimal type input to `DecimalSum`. > Won't we also need to check for overflow in here as well. ah yes, thanks for catching it! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-597440834 @skambha great analysis! I agree with you that we need another boolean flag in the sum aggregate buffer, but I'd like to make it simpler and only change it for decimals. How about we add a new expression `DecimalSum`? In which: 1. the buffer attributes are [sum, isEmpty] 2. initial value is [0, true] 3. the `updateExpression` should do: 3.1 update `isEmpty` to false 3.2 set `sum` to null if overflowed 3.3 do nothing if `sum` is already null. 4. the `mergeExpression` should do: 4.1 update `isEmpty` to false 4.2 if the input buffer's `isEmpty` is true, keep sum unchanged 4.3 if the input buffer's `isEmpty` is false, but `sum` is null, update its own `sum` to null 4.4 do nothing if `sum` is already null. 4.5 otherwise, add input buffer's `sum` 5. the `evaluateExpression` should do: 5.1 output null if `isEmpty` is true 5.2 fail if `sum` is null and ansi mode is on 5.3 otherwise, output the sum. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-596918804 @skambha thanks for your investigation! Very helpful. I'm trying to determine the scope of this bug. Your analysis is for global aggregate (no GROUP BY), when we only write out the sum value at the end of partial aggregate. I tried with group by ``` df.select(expr(s"cast('$decimalStr' as decimal (38, 18)) as d"), lit(1).as("key")) .groupBy("key").agg(sum($"d")).show java.lang.ArithmeticException: Decimal precision 39 exceeds max precision 38 at org.apache.spark.sql.types.Decimal.set(Decimal.scala:122) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574) at org.apache.spark.sql.types.Decimal.apply(Decimal.scala) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:393) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ... ``` Do you what happens there? I'd expect we write overflowed decimal as null to the aggregate hash map, but it seems not the case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-595600181 > Addition of two decimal values ( expression coming from sum) that results in value that cannot be contained. You can try `Decimal.+(Decimal)` locally, it does return a value that is not null. We can't hold an overflowed decimal value in unsafe row, but a `Decimal` object can be temporary overflowed. In my repro, `spark.range(0, 12, 1, 1)` works fine and `spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))` gives wrong result. I looked at the code again and whole-stage-codegen also stores the partial aggregate result to unsafe row. Can someone investigate it further and see why `spark.range(0, 12, 1, 1)` works? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-595330926 Note: the semantic of most aggregate functions are skipping nulls. So sum will skip null inputs, and overflow doesn't return null (also true for decimal), so `updateExpressions` is fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-595319742 > looks like the overflow happens during partial aggregate ( updateExpressions and mergeExpressions) partial aggregate does `updateExpressions` and produces buffer, final aggregate does `mergeExpressions` to merge buffers. The problem here is, `mergeExpressions` treats null as 0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-595112843 Ideally, under ANSI mode, decimal sum should also fail if overflows. It's hard to fail at the partial aggregate side, as we don't have a chance to check overflow before shuffling the aggregate buffer row. We can fail at the final aggregate side: If the value is null and `isEmpty` is false, fail with the overflow exception. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-595067684 cc @viirya @maropu as well This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-595067541 find a way to reproduce without join ``` scala> val decimalStr = "1" + "0" * 19 decimalStr: String = 1000 scala> val df = spark.range(0, 12, 1, 1) df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.select(expr(s"cast('$decimalStr' as decimal (38, 18)) as d")).agg(sum($"d")).show // This is correct +--+ |sum(d)| +--+ | null| +--+ scala> val df = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1)) df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.select(expr(s"cast('$decimalStr' as decimal (38, 18)) as d")).agg(sum($"d")).show // This is wrong ++ | sum(d)| ++ |1...| ++ ``` I think the root cause is, `sum` in partial aggregate overflows and write null to the unsafe row. `sum` in final aggregate reads null from the unsafe row and mistakenly think it's caused by empty data and convert it to 0. We should create a `DecimalSum`, which use 2 buffer attributes: `sum` and `isEmpty`. Then in final aggregate we can check the `isEmpty` flag to konw if the null is caused by overflow or empty data. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-594644908 I tried some queries locally but can't reproduce. Is it only a problem with a join? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-594010285 > In case of the whole stage codegen, you can see the decimal ‘+’ expressions will at some point be larger than what can be contained in dec 38,18 but it gets written out as null. This messes up the end result of the sum and you get wrong results. Still a bit confused. If it gets written out as null, then null + decimal always return null and the final result is null? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-593977066 OK to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
cloud-fan commented on issue #27627: [WIP][SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow URL: https://github.com/apache/spark/pull/27627#issuecomment-589530921 > Sum does not take care of possibility of overflow for the intermediate steps. ie the updateExpressions and mergeExpressions. I'm a little confused. These expressions are used in non-whole-stage-codegen as well, why only whole-stage-codegen has the problem? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org