[jira] [Updated] (SPARK-14231) JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.

Hyukjin Kwon (JIRA) Mon, 28 Mar 2016 22:45:37 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-14231:
---------------------------------
    Description: 
Currently, JSON data source supports {{floatAsBigDecimal}} option, which reads 
floats as {{DecimalType}}.

I noticed there are several restrictions in Spark {{DecimalType}} below:

1. The precision cannot be bigger than 38.
2. scale cannot be bigger than precision. 

However, with the option above, it reads {{BigDecimal}} which does not follow 
the conditions above.

This could be observed as below:

{code}
def simpleFloats: RDD[String] =
  sqlContext.sparkContext.parallelize(
    """{"a": 0.01}""" ::
    """{"a": 0.02}""" :: Nil)

val jsonDF = sqlContext.read
  .option("floatAsBigDecimal", "true")
  .json(simpleFloats)
jsonDF.printSchema()
{code}

throws an exception below:

{code}
org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater 
than precision (1).;
        at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57)
        at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
...
{code}

Since JSON data source infers {{DataType}} as {{StringType}} when it fails to 
infer, it might have to be inferred as {{StringType}} or maybe just simply 
{{DoubleType}}

  was:
Currently, JSON data source supports {{floatAsBigDecimal}} option, which reads 
floats as {{DecimalType}}.

I noticed there are several restrictions in Spark {{DecimalType}} below:

1. The precision cannot be bigger than 38.
2. scale cannot be bigger than precision. 

However, with the option above, it reads `BigDecimal` which does not follow the 
conditions above.

This could be observed as below:

{code}
def simpleFloats: RDD[String] =
  sqlContext.sparkContext.parallelize(
    """{"a": 0.01}""" ::
    """{"a": 0.02}""" :: Nil)

val jsonDF = sqlContext.read
  .option("floatAsBigDecimal", "true")
  .json(simpleFloats)
jsonDF.printSchema()
{code}

throws an exception below:

{code}
org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater 
than precision (1).;
        at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57)
        at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
...
{code}

Since JSON data source infers {{DataType}} as {{StringType}} when it fails to 
infer, it might have to be inferred as {{StringType}} or maybe just simply 
{{DoubleType}}


> JSON data source fails to infer floats as decimal when precision is bigger 
> than 38 or scale is bigger than precision.
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14231
>                 URL: https://issues.apache.org/jira/browse/SPARK-14231
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Hyukjin Kwon
>            Priority: Minor
>             Fix For: 2.0.0
>
>
> Currently, JSON data source supports {{floatAsBigDecimal}} option, which 
> reads floats as {{DecimalType}}.
> I noticed there are several restrictions in Spark {{DecimalType}} below:
> 1. The precision cannot be bigger than 38.
> 2. scale cannot be bigger than precision. 
> However, with the option above, it reads {{BigDecimal}} which does not follow 
> the conditions above.
> This could be observed as below:
> {code}
> def simpleFloats: RDD[String] =
>   sqlContext.sparkContext.parallelize(
>     """{"a": 0.01}""" ::
>     """{"a": 0.02}""" :: Nil)
> val jsonDF = sqlContext.read
>   .option("floatAsBigDecimal", "true")
>   .json(simpleFloats)
> jsonDF.printSchema()
> {code}
> throws an exception below:
> {code}
> org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater 
> than precision (1).;
>       at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
>       at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
>       at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
>       at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59)
>       at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57)
>       at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249)
>       at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57)
>       at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55)
>       at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> ...
> {code}
> Since JSON data source infers {{DataType}} as {{StringType}} when it fails to 
> infer, it might have to be inferred as {{StringType}} or maybe just simply 
> {{DoubleType}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14231) JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.

Reply via email to