[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
xsys updated SPARK-40439: ------------------------- Description: h3. Describe the bug We are trying to store a DECIMAL value {{33333333333.2222222222}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 33333333333.2222222222] java.lang.ArithmeticException: Decimal(expanded,33333333333.2222222222,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 33333333333.2222222222;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 33333333333.2222222222;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}33333333333.2222222222{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("33333333333.2222222222")))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at <console>:27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() +----+ | c1| +----+ |null| +----+ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{33333333333.2222222222}} after setting {{spark.sql.ansi.enabled}} _to {{False}}_ failed as well (which may be an independent issue). was: h3. Describe the bug We are trying to store a DECIMAL value {{33333333333.2222222222}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 33333333333.2222222222] java.lang.ArithmeticException: Decimal(expanded,33333333333.2222222222,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 33333333333.2222222222;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 33333333333.2222222222;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}33333333333.2222222222{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("33333333333.2222222222")))) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at <console>:27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() +----+ | c1| +----+ |null| +----+ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{33333333333.2222222222}} after setting {{spark.sql.ansi.enabled}} _to {{False}}_ failed as well (which may be an independent issue). > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > ------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.1 > Reporter: xsys > Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{33333333333.2222222222}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 33333333333.2222222222] > java.lang.ArithmeticException: > Decimal(expanded,33333333333.2222222222,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 33333333333.2222222222;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 33333333333.2222222222;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}33333333333.2222222222{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("33333333333.2222222222")))) > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at <console>:27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > +----+ > | c1| > +----+ > |null| > +----+ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, > scale: Int, > roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, > nullOnOverflow: Boolean = true, > context: SQLQueryContext = null): Decimal = { > val copy = clone() > if (copy.changePrecision(precision, scale, roundMode)) { > copy > } else { > if (nullOnOverflow) { > null > } else { > throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( > this, precision, scale, context) > } > } > }{code} > The above function is invoked from > [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] > (in Cast.scala). However, our attempt to insert {{33333333333.2222222222}} > after setting {{spark.sql.ansi.enabled}} _to {{False}}_ failed as well > (which may be an independent issue). > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org