[ https://issues.apache.org/jira/browse/SPARK-47134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dylan Walker updated SPARK-47134: --------------------------------- Description: In specific cases, casting decimal values can result in `null` values where no overflow exists. The cases appear very specific, and I don't have the depth of knowledge to generalize this issue, so here is a simple spark-shell reproduction: *Setup:* {code:scala} scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", x)).toDS ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] scala> ds.createOrReplaceTempView("t") {code} *Spark 3.2.1 behaviour (correct):* {code:scala} scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct FROM t GROUP BY `_1` ORDER BY ct ASC").show() +--------------------+ | ct| +--------------------+ | 9508.00000000000000| |13879.00000000000000| +--------------------+ {code} *Spark 3.4.1 / Spark 3.5.0 behaviour:* {code:scala} scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct FROM t GROUP BY `_1` ORDER BY ct ASC").show() +-------------------+ | ct| +-------------------+ | null| |9508.00000000000000| +-------------------+ {code} This is fairly delicate: - removing the `ORDER BY` clause produces the correct result - removing the `CAST` produces the correct result - changing the number of 0s in the argument to `SUM` produces the correct result - setting `spark.ansi.enabled` to `true` produces the correct result (and does not throw an error) Also, removing the `ORDER BY`, but writing `ds` to a parquet will also result in the unexpected nulls. Please let me know if you need additional information. We are also interested in understanding whether setting `spark.ansi.enabled` can be considered a reliable workaround to this issue prior to a fix being released, if possible. was: In specific cases, casting decimal values can result in `null` values where no overflow exists. The cases appear very specific, and I don't have the depth of knowledge to generalize this issue, so here is a simple spark-shell reproduction: *Setup:* {code:scala} scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", x)).toDS ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] scala> ds.createOrReplaceTempView("t") {code} *Spark 3.2.1 behaviour (correct):* {code:scala} scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct FROM t GROUP BY `_1` ORDER BY ct ASC").show() +--------------------+ |ct| +--------------------+ |9508.00000000000000| |13879.00000000000000| +--------------------+ {code} *Spark 3.4.1 / Spark 3.5.0 behaviour:* {code:scala} scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct FROM t GROUP BY `_1` ORDER BY ct ASC").show() +-------------------+ |ct| +-------------------+ |null| |9508.00000000000000| +-------------------+ {code} This is fairly delicate: - removing the `ORDER BY` clause produces the correct result - removing the `CAST` produces the correct result - changing the number of 0s in the argument to `SUM` produces the correct result - setting `spark.ansi.enabled` to `true` produces the correct result (and does not throw an error) Also, removing the `ORDER BY`, but writing `ds` to a parquet will also result in the unexpected nulls. Please let me know if you need additional information. We are also interested in understanding whether setting `spark.ansi.enabled` can be considered a reliable workaround to this issue prior to a fix being released, if possible. > Unexpected nulls when casting decimal values in specific cases > -------------------------------------------------------------- > > Key: SPARK-47134 > URL: https://issues.apache.org/jira/browse/SPARK-47134 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.4.1, 3.5.0 > Reporter: Dylan Walker > Priority: Major > > In specific cases, casting decimal values can result in `null` values where > no overflow exists. > The cases appear very specific, and I don't have the depth of knowledge to > generalize this issue, so here is a simple spark-shell reproduction: > *Setup:* > {code:scala} > scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", > x)).toDS > ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] > scala> ds.createOrReplaceTempView("t") > {code} > > *Spark 3.2.1 behaviour (correct):* > {code:scala} > scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct > FROM t GROUP BY `_1` ORDER BY ct ASC").show() > +--------------------+ > | ct| > +--------------------+ > | 9508.00000000000000| > |13879.00000000000000| > +--------------------+ > {code} > *Spark 3.4.1 / Spark 3.5.0 behaviour:* > {code:scala} > scala> spark.sql("select CAST(SUM(1.00000000000000) AS DECIMAL(28,14)) as ct > FROM t GROUP BY `_1` ORDER BY ct ASC").show() > +-------------------+ > | ct| > +-------------------+ > | null| > |9508.00000000000000| > +-------------------+ > {code} > This is fairly delicate: > - removing the `ORDER BY` clause produces the correct result > - removing the `CAST` produces the correct result > - changing the number of 0s in the argument to `SUM` produces the correct > result > - setting `spark.ansi.enabled` to `true` produces the correct result (and > does not throw an error) > Also, removing the `ORDER BY`, but writing `ds` to a parquet will also result > in the unexpected nulls. > Please let me know if you need additional information. > We are also interested in understanding whether setting `spark.ansi.enabled` > can be considered a reliable workaround to this issue prior to a fix being > released, if possible. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org