[jira] [Comment Edited] (SPARK-18484) case class datasets - ability to specify decimal precision and scale

Bill Schneider (JIRA) Tue, 14 May 2019 14:09:20 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837269#comment-16837269
 ]


Bill Schneider edited comment on SPARK-18484 at 5/14/19 9:09 PM:
-----------------------------------------------------------------

I agree with [~bonazzaf].  Where this is a real problem is when you read from 
an existing Parquet file with a specific decimal precision/scale, cast to a 
case class for a typed DataSet, then write back to Parquet.  

If I have a Parquet file with decimal(24,2), I should be able to do something 
like

 
{code:java}
case class WithDecimal(x: BigDecimal)

val input = spark.read.parquet("file_with_decimal")
// today would have to add .withColumn("x", cast($"x", "decimal(38,18)")) 

val typedDs = input.as[WithDecimal]

val output = doSomeStuffWith(typedDs)

spark.write.parquet(output){code}
If I want to do this, I have to cast `x` to decimal(38,18) before I can use a 
typed dataset, then I have to remember to cast it back if I don't want to write 
38,18 to Parquet.

Where this gets to be a real problem is in the unlikely event I really do have 
more than 20 digits left of the decimal point and can't cast to (38,18) without 
truncating. 

In this case, I'm not inferring a schema from a case class.  I already have a 
DataFrame, it already has a schema from Parquet, and further, it's compatible 
with the target case class even without a cast as BigDecimal can hold arbitrary 
precision/scale.  


was (Author: wrschneider99):
I agree with [~bonazzaf].  Where this is a real problem is when you read from 
an existing Parquet file with a specific decimal precision/scale, cast to a 
case class for a typed DataSet, then write back to Parquet.  

If I have a Parquet file with decimal(10,2), I should be able to do something 
like

 
{code:java}
case class WithDecimal(x: BigDecimal)

val input = spark.read.parquet("file_with_decimal")
// today would have to add .withColumn("x", cast($"x", "decimal(38,18)")) 

val typedDs = input.as[WithDecimal]

val output = doSomeStuffWith(typedDs)

spark.write.parquet(output){code}
If I want to do this, I have to cast `x` to decimal(38,18) before I can use a 
typed dataset, then I have to remember to cast it back if I don't want to write 
38,18 to Parquet.  This is at best annoying and at worst a real problem if my 
underlying type is something like decimal(22,0) and I really can't cast to 
(38,18) without truncating. 

In this case, I'm not inferring a schema from a case class.  I already have a 
DataFrame, it already has a schema from Parquet, and further, it's compatible 
with the target case class even without a cast as BigDecimal can hold arbitrary 
precision/scale.  

> case class datasets - ability to specify decimal precision and scale
> --------------------------------------------------------------------
>
>                 Key: SPARK-18484
>                 URL: https://issues.apache.org/jira/browse/SPARK-18484
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 2.0.0, 2.0.1
>            Reporter: Damian Momot
>            Priority: Major
>
> Currently when using decimal type (BigDecimal in scala case class) there's no 
> way to enforce precision and scale. This is quite critical when saving data - 
> regarding space usage and compatibility with external systems (for example 
> Hive table) because spark saves data as Decimal(38,18)
> {code}
> case class TestClass(id: String, money: BigDecimal)
> val testDs = spark.createDataset(Seq(
>   TestClass("1", BigDecimal("22.50")),
>   TestClass("2", BigDecimal("500.66"))
> ))
> testDs.printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(38,18) (nullable = true)
> {code}
> Workaround is to convert dataset to dataframe before saving and manually cast 
> to specific decimal scale/precision:
> {code}
> import org.apache.spark.sql.types.DecimalType
> val testDf = testDs.toDF()
> testDf
>   .withColumn("money", testDf("money").cast(DecimalType(10,2)))
>   .printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(10,2) (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18484) case class datasets - ability to specify decimal precision and scale

Reply via email to