[ https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880700#comment-16880700 ]
Marco Gaido commented on SPARK-28067: ------------------------------------- I cannot reproduce in 2.4.0 either: {code} spark-2.4.0-bin-hadoop2.7 xxx$ ./bin/spark-shell 2019-07-08 22:52:11 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://xxx:4040 Spark context available as 'sc' (master = local[*], app id = local-1562619141279). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152) Type in expressions to have them evaluated. Type :help for more information. scala> val df = Seq( | (BigDecimal("10000000000000000000"), 1), | (BigDecimal("10000000000000000000"), 1), | (BigDecimal("10000000000000000000"), 2), | (BigDecimal("10000000000000000000"), 2), | (BigDecimal("10000000000000000000"), 2), | (BigDecimal("10000000000000000000"), 2), | (BigDecimal("10000000000000000000"), 2), | (BigDecimal("10000000000000000000"), 2), | (BigDecimal("10000000000000000000"), 2), | (BigDecimal("10000000000000000000"), 2), | (BigDecimal("10000000000000000000"), 2), | (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum") df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int] scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum")) df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] scala> df2.show(40,false) +-----------+ |sum(decNum)| +-----------+ |null | +-----------+ {code} > Incorrect results in decimal aggregation with whole-stage code gen enabled > -------------------------------------------------------------------------- > > Key: SPARK-28067 > URL: https://issues.apache.org/jira/browse/SPARK-28067 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.0, 2.4.0 > Environment: Ubuntu LTS 16.04 > Oracle Java 1.8.0_201 > spark-2.4.3-bin-without-hadoop > spark-shell > Reporter: Mark Sirek > Priority: Minor > Labels: correctness > > The following test case involving a join followed by a sum aggregation > returns the wrong answer for the sum: > > {code:java} > val df = Seq( > (BigDecimal("10000000000000000000"), 1), > (BigDecimal("10000000000000000000"), 1), > (BigDecimal("10000000000000000000"), 2), > (BigDecimal("10000000000000000000"), 2), > (BigDecimal("10000000000000000000"), 2), > (BigDecimal("10000000000000000000"), 2), > (BigDecimal("10000000000000000000"), 2), > (BigDecimal("10000000000000000000"), 2), > (BigDecimal("10000000000000000000"), 2), > (BigDecimal("10000000000000000000"), 2), > (BigDecimal("10000000000000000000"), 2), > (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum") > val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, > "intNum").agg(sum("decNum")) > scala> df2.show(40,false) > --------------------------------------- > sum(decNum) > --------------------------------------- > 40000000000000000000.000000000000000000 > --------------------------------------- > > {code} > > The result should be 1040000000000000000000.0000000000000000. > It appears a partial sum is computed for each join key, as the result > returned would be the answer for all rows matching intNum === 1. > If only the rows with intNum === 2 are included, the answer given is null: > > {code:java} > scala> val df3 = df.filter($"intNum" === lit(2)) > df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: > decimal(38,18), intNum: int] > scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, > "intNum").agg(sum("decNum")) > df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] > scala> df4.show(40,false) > ----------- > sum(decNum) > ----------- > null > ----------- > > {code} > > The correct answer, 1000000000000000000000.0000000000000000, doesn't fit in > the DataType picked for the result, decimal(38,18), so an overflow occurs, > which Spark then converts to null. > The first example, which doesn't filter out the intNum === 1 values should > also return null, indicating overflow, but it doesn't. This may mislead the > user to think a valid sum was computed. > If whole-stage code gen is turned off: > spark.conf.set("spark.sql.codegen.wholeStage", false) > ... incorrect results are not returned because the overflow is caught as an > exception: > java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 > exceeds max precision 38 > > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org