[jira] [Comment Edited] (SPARK-48965) toJSON produces wrong values if DecimalType information is lost in as[Product]

Bruce Robbins (Jira) Fri, 16 Aug 2024 11:54:13 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-48965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874352#comment-17874352
 ]


Bruce Robbins edited comment on SPARK-48965 at 8/16/24 6:52 PM:
----------------------------------------------------------------

It's not just decimals. {{toJSON}} is simply using the wrong schema. For 
example:

{noformat}
scala> case class Data(x: Int, y: String)
class Data

scala> sql("select 'Hey there' as y, 22 as x").as[Data].collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res0: Array[Data] = Array(Data(22,Hey there))

scala> sql("select 'Hey there' as y, 22 as x").as[Data].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res1: Array[String] = 
Array({"y":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0016\u0000\u0000\u0000\u0000\u0000\u0000\u0000\t\u0000\u0000\u0000\u0018\u0000","x":9})
scala> 
{noformat}
Edit: Even more interesting, you can crash the JVM:
{noformat}
scala> case class Data(x: Array[Int], y: String)
class Data

scala> sql("select repeat('Hey there', 17) as y, array_repeat(22, 17) as 
x").as[Data].collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res0: Array[Data] = Array(Data([I@3ca370ae,Hey thereHey thereHey thereHey 
thereHey thereHey thereHey thereHey thereHey thereHey thereHey thereHey 
thereHey thereHey thereHey thereHey thereHey there))

scala> sql("select repeat('Hey there', 17) as y, array_repeat(22, 17) as 
x").as[Data].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.InternalError: a fault occurred in a recent unsafe memory access 
operation in compiled Java code
        at 
org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$5(JacksonGenerator.scala:129)
 ~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$5$adapted(JacksonGenerator.scala:128)
 ...
{noformat}


was (Author: bersprockets):
It's not just decimals. {{toJSON}} is simply using the wrong schema. For 
example:

{noformat}
scala> case class Data(x: Int, y: String)
class Data

scala> sql("select 'Hey there' as y, 22 as x").as[Data].collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res0: Array[Data] = Array(Data(22,Hey there))

scala> sql("select 'Hey there' as y, 22 as x").as[Data].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
val res1: Array[String] = 
Array({"y":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0016\u0000\u0000\u0000\u0000\u0000\u0000\u0000\t\u0000\u0000\u0000\u0018\u0000","x":9})
scala> 
{noformat}

> toJSON produces wrong values if DecimalType information is lost in as[Product]
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-48965
>                 URL: https://issues.apache.org/jira/browse/SPARK-48965
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.1, 3.5.1
>            Reporter: Dmitry Lapshin
>            Priority: Major
>
> Consider this example:
> {code:scala}
> package com.jetbrains.jetstat.etl
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.DecimalType
> object A {
>   case class Example(x: BigDecimal)
>   def main(args: Array[String]): Unit = {
>     val spark = SparkSession.builder()
>       .master("local[1]")
>       .getOrCreate()
>     import spark.implicits._
>     val originalRaw = BigDecimal("123.456")
>     val original = Example(originalRaw)
>     val ds1 = spark.createDataset(Seq(original))
>     val ds2 = ds1
>       .withColumn("x", $"x" cast DecimalType(12, 6))
>     val ds3 = ds2
>       .as[Example]
>     println(s"DS1: schema=${ds1.schema}, 
> encoder.schema=${ds1.encoder.schema}")
>     println(s"DS2: schema=${ds1.schema}, 
> encoder.schema=${ds2.encoder.schema}")
>     println(s"DS3: schema=${ds1.schema}, 
> encoder.schema=${ds3.encoder.schema}")
>     val json1 = ds1.toJSON.collect().head
>     val json2 = ds2.toJSON.collect().head
>     val json3 = ds3.toJSON.collect().head
>     val collect1 = ds1.collect().head
>     val collect2_ = ds2.collect().head
>     val collect2 = collect2_.getDecimal(collect2_.fieldIndex("x"))
>     val collect3 = ds3.collect().head
>     println(s"Original: $original (scale = ${original.x.scale}, precision = 
> ${original.x.precision})")
>     println(s"Collect1: $collect1 (scale = ${collect1.x.scale}, precision = 
> ${collect1.x.precision})")
>     println(s"Collect2: $collect2 (scale = ${collect2.scale}, precision = 
> ${collect2.precision})")
>     println(s"Collect3: $collect3 (scale = ${collect3.x.scale}, precision = 
> ${collect3.x.precision})")
>     println(s"json1: $json1")
>     println(s"json2: $json2")
>     println(s"json3: $json3")
>   }
> }
> {code}
> Running it you'd see that json3 contains very much wrong data. After a bit of 
> debugging, and sorry since I'm bad with Spark internals, I've found that:
>  * In-memory representation of the data in this example used {{UnsafeRow}}, 
> whose {{.getDecimal}} uses compression to store small Decimal values as 
> longs, but doesn't remember decimal sizing parameters,
>  * However, there are at least two sources for precision & scale to pass to 
> that method: {{Dataset.schema}} (which is based on query execution, always 
> contains 38,18 for me) and {{Dataset.encoder.schema}} (that gets updated in 
> `ds2` to 12,6 but then is reset in `ds3`). Also, there is a 
> {{Dataset.deserializer}} that seems to be combining those two non-trivially.
>  * This doesn't seem to affect {{Dataset.collect()}} methods since they use 
> {{deserializer}}, but {{Dataset.toJSON}} only uses the first schema.
> Seems to me that either {{.toJSON}} should be more aware of what's going on 
> or {{.as[]}} should be doing something else.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-48965) toJSON produces wrong values if DecimalType information is lost in as[Product]

Reply via email to