[ https://issues.apache.org/jira/browse/SPARK-48965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874352#comment-17874352 ]
Bruce Robbins commented on SPARK-48965: --------------------------------------- It's not just decimals. {{toJSON}} is simply using the wrong schema. For example: {noformat} scala> case class Data(x: Int, y: String) class Data scala> sql("select 'Hey there' as y, 22 as x").as[Data].collect warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` val res0: Array[Data] = Array(Data(22,Hey there)) scala> sql("select 'Hey there' as y, 22 as x").as[Data].toJSON.collect warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` val res1: Array[String] = Array({"y":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0016\u0000\u0000\u0000\u0000\u0000\u0000\u0000\t\u0000\u0000\u0000\u0018\u0000","x":9}) scala> {noformat} > toJSON produces wrong values if DecimalType information is lost in as[Product] > ------------------------------------------------------------------------------ > > Key: SPARK-48965 > URL: https://issues.apache.org/jira/browse/SPARK-48965 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.1, 3.5.1 > Reporter: Dmitry Lapshin > Priority: Major > > Consider this example: > {code:scala} > package com.jetbrains.jetstat.etl > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.types.DecimalType > object A { > case class Example(x: BigDecimal) > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder() > .master("local[1]") > .getOrCreate() > import spark.implicits._ > val originalRaw = BigDecimal("123.456") > val original = Example(originalRaw) > val ds1 = spark.createDataset(Seq(original)) > val ds2 = ds1 > .withColumn("x", $"x" cast DecimalType(12, 6)) > val ds3 = ds2 > .as[Example] > println(s"DS1: schema=${ds1.schema}, > encoder.schema=${ds1.encoder.schema}") > println(s"DS2: schema=${ds1.schema}, > encoder.schema=${ds2.encoder.schema}") > println(s"DS3: schema=${ds1.schema}, > encoder.schema=${ds3.encoder.schema}") > val json1 = ds1.toJSON.collect().head > val json2 = ds2.toJSON.collect().head > val json3 = ds3.toJSON.collect().head > val collect1 = ds1.collect().head > val collect2_ = ds2.collect().head > val collect2 = collect2_.getDecimal(collect2_.fieldIndex("x")) > val collect3 = ds3.collect().head > println(s"Original: $original (scale = ${original.x.scale}, precision = > ${original.x.precision})") > println(s"Collect1: $collect1 (scale = ${collect1.x.scale}, precision = > ${collect1.x.precision})") > println(s"Collect2: $collect2 (scale = ${collect2.scale}, precision = > ${collect2.precision})") > println(s"Collect3: $collect3 (scale = ${collect3.x.scale}, precision = > ${collect3.x.precision})") > println(s"json1: $json1") > println(s"json2: $json2") > println(s"json3: $json3") > } > } > {code} > Running it you'd see that json3 contains very much wrong data. After a bit of > debugging, and sorry since I'm bad with Spark internals, I've found that: > * In-memory representation of the data in this example used {{UnsafeRow}}, > whose {{.getDecimal}} uses compression to store small Decimal values as > longs, but doesn't remember decimal sizing parameters, > * However, there are at least two sources for precision & scale to pass to > that method: {{Dataset.schema}} (which is based on query execution, always > contains 38,18 for me) and {{Dataset.encoder.schema}} (that gets updated in > `ds2` to 12,6 but then is reset in `ds3`). Also, there is a > {{Dataset.deserializer}} that seems to be combining those two non-trivially. > * This doesn't seem to affect {{Dataset.collect()}} methods since they use > {{deserializer}}, but {{Dataset.toJSON}} only uses the first schema. > Seems to me that either {{.toJSON}} should be more aware of what's going on > or {{.as[]}} should be doing something else. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org