[jira] [Commented] (SPARK-47134) Unexpected nulls when casting decimal values in specific cases
[ https://issues.apache.org/jira/browse/SPARK-47134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844720#comment-17844720 ] Dylan Walker commented on SPARK-47134: -- This ticket can be withdrawn. I can confirm that it is not an issue with ASF's Spark distributions. I have not been permitted to provide further details, nor is there publicly available information to point to. Apologies for the misdirected request and delayed followup. > Unexpected nulls when casting decimal values in specific cases > -- > > Key: SPARK-47134 > URL: https://issues.apache.org/jira/browse/SPARK-47134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Dylan Walker >Priority: Major > Attachments: 321queryplan.txt, 341queryplan.txt > > > In specific cases, casting decimal values can result in `null` values where > no overflow exists. > The cases appear very specific, and I don't have the depth of knowledge to > generalize this issue, so here is a simple spark-shell reproduction: > *Setup:* > {code:scala} > scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", > x)).toDS > ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] > scala> ds.createOrReplaceTempView("t") > {code} > > *Spark 3.2.1 behaviour (correct):* > {code:scala} > scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct > FROM t GROUP BY `_1` ORDER BY ct ASC").show() > ++ > | ct| > ++ > | 9508.00| > |13879.00| > ++ > {code} > *Spark 3.4.1 / Spark 3.5.0 behaviour:* > {code:scala} > scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct > FROM t GROUP BY `_1` ORDER BY ct ASC").show() > +---+ > | ct| > +---+ > | null| > |9508.00| > +---+ > {code} > This is fairly delicate: > - removing the {{ORDER BY}} clause produces the correct result > - removing the {{CAST}} produces the correct result > - changing the number of 0s in the argument to {{SUM}} produces the correct > result > - setting {{spark.ansi.enabled}} to {{true}} produces the correct result > (and does not throw an error) > Also, removing the {{ORDER BY}}, but writing {{ds}} to a parquet will also > result in the unexpected nulls. > Please let me know if you need additional information. > We are also interested in understanding whether setting > {{spark.ansi.enabled}} can be considered a reliable workaround to this issue > prior to a fix being released, if possible. > Text files that include {{explain()}} output attached. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47134) Unexpected nulls when casting decimal values in specific cases
[ https://issues.apache.org/jira/browse/SPARK-47134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819834#comment-17819834 ] Dylan Walker commented on SPARK-47134: -- [~bersprockets] Hmm, it's possible I may have made too many assumptions. I left out that this is on EMR, which does have its own fork of Spark. If this is referring to names that don't exist in the Apache Spark codebase, this may be an Amazon thing. I will reach out to AWS support to confirm, and apologies if this turns out to be the case. > Unexpected nulls when casting decimal values in specific cases > -- > > Key: SPARK-47134 > URL: https://issues.apache.org/jira/browse/SPARK-47134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Dylan Walker >Priority: Major > Attachments: 321queryplan.txt, 341queryplan.txt > > > In specific cases, casting decimal values can result in `null` values where > no overflow exists. > The cases appear very specific, and I don't have the depth of knowledge to > generalize this issue, so here is a simple spark-shell reproduction: > *Setup:* > {code:scala} > scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", > x)).toDS > ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] > scala> ds.createOrReplaceTempView("t") > {code} > > *Spark 3.2.1 behaviour (correct):* > {code:scala} > scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct > FROM t GROUP BY `_1` ORDER BY ct ASC").show() > ++ > | ct| > ++ > | 9508.00| > |13879.00| > ++ > {code} > *Spark 3.4.1 / Spark 3.5.0 behaviour:* > {code:scala} > scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct > FROM t GROUP BY `_1` ORDER BY ct ASC").show() > +---+ > | ct| > +---+ > | null| > |9508.00| > +---+ > {code} > This is fairly delicate: > - removing the {{ORDER BY}} clause produces the correct result > - removing the {{CAST}} produces the correct result > - changing the number of 0s in the argument to {{SUM}} produces the correct > result > - setting {{spark.ansi.enabled}} to {{true}} produces the correct result > (and does not throw an error) > Also, removing the {{ORDER BY}}, but writing {{ds}} to a parquet will also > result in the unexpected nulls. > Please let me know if you need additional information. > We are also interested in understanding whether setting > {{spark.ansi.enabled}} can be considered a reliable workaround to this issue > prior to a fix being released, if possible. > Text files that include {{explain()}} output attached. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47134) Unexpected nulls when casting decimal values in specific cases
[ https://issues.apache.org/jira/browse/SPARK-47134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819789#comment-17819789 ] Bruce Robbins commented on SPARK-47134: --- Oddly, I cannot reproduce on either 3.4.1 or 3.5.0. Also, my 3.4.1 plan doesn't look like your 3.4.1 plan: My plan uses {{sum}}, your plan uses {{decimalsum}}. I can't find where {{decimalsum}} comes from in the code base, but maybe I am not looking hard enough. {noformat} scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", x)).toDS ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] scala> ds.createOrReplaceTempView("t") scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct FROM t GROUP BY `_1` ORDER BY ct ASC").show() ++ | ct| ++ | 9508.00| |13879.00| ++ scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct FROM t GROUP BY `_1` ORDER BY ct ASC").explain == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [ct#19 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(ct#19 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=68] +- HashAggregate(keys=[_1#2], functions=[sum(1.00)]) +- Exchange hashpartitioning(_1#2, 200), ENSURE_REQUIREMENTS, [plan_id=65] +- HashAggregate(keys=[_1#2], functions=[partial_sum(1.00)]) +- LocalTableScan [_1#2] scala> sql("select version()").show(false) +--+ |version() | +--+ |3.4.1 6b1ff22dde1ead51cbf370be6e48a802daae58b6| +--+ scala> {noformat} > Unexpected nulls when casting decimal values in specific cases > -- > > Key: SPARK-47134 > URL: https://issues.apache.org/jira/browse/SPARK-47134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Dylan Walker >Priority: Major > Attachments: 321queryplan.txt, 341queryplan.txt > > > In specific cases, casting decimal values can result in `null` values where > no overflow exists. > The cases appear very specific, and I don't have the depth of knowledge to > generalize this issue, so here is a simple spark-shell reproduction: > *Setup:* > {code:scala} > scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", > x)).toDS > ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] > scala> ds.createOrReplaceTempView("t") > {code} > > *Spark 3.2.1 behaviour (correct):* > {code:scala} > scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct > FROM t GROUP BY `_1` ORDER BY ct ASC").show() > ++ > | ct| > ++ > | 9508.00| > |13879.00| > ++ > {code} > *Spark 3.4.1 / Spark 3.5.0 behaviour:* > {code:scala} > scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct > FROM t GROUP BY `_1` ORDER BY ct ASC").show() > +---+ > | ct| > +---+ > | null| > |9508.00| > +---+ > {code} > This is fairly delicate: > - removing the {{ORDER BY}} clause produces the correct result > - removing the {{CAST}} produces the correct result > - changing the number of 0s in the argument to {{SUM}} produces the correct > result > - setting {{spark.ansi.enabled}} to {{true}} produces the correct result > (and does not throw an error) > Also, removing the {{ORDER BY}}, but writing {{ds}} to a parquet will also > result in the unexpected nulls. > Please let me know if you need additional information. > We are also interested in understanding whether setting > {{spark.ansi.enabled}} can be considered a reliable workaround to this issue > prior to a fix being released, if possible. > Text files that include {{explain()}} output attached. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org