[ 
https://issues.apache.org/jira/browse/SPARK-19488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19488.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 2.2.0

Issue resolved by pull request 16834
[https://github.com/apache/spark/pull/16834]

> CSV infer schema does not take into account Inf,-Inf,NaN
> --------------------------------------------------------
>
>                 Key: SPARK-19488
>                 URL: https://issues.apache.org/jira/browse/SPARK-19488
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.2
>         Environment: Windows 10, SparkShell
>            Reporter: Shivam Dalmia
>            Assignee: Song Jun
>              Labels: easyfix, features
>             Fix For: 2.2.0
>
>
> I observed that while loading a CSV as a dataframe, user-specified values for 
> nanValue, positiveInf and negativeInf are disregarded when inferSchema = 
> true. (They work if a user-specified schema is provided). However, even the 
> spark defaults for the infinities (Inf and -Inf) do not work with 
> inferSchema. 
> Taking a look at the source code for the inferSchema for CSV 
> (CSVInferSchema.scala), I found the following code snippet.
> {code}
> 1.            private def tryParseDouble(field: String, options: CSVOptions): 
> DataType = {
> 2.                if ((allCatch opt field.toDouble).isDefined) {
> 3.                  DoubleType
> 4.                } else {
> 5.                  tryParseTimestamp(field, options)
> 6.                }
> 7.              }
> 8.            
> 9.              private def tryParseTimestamp(field: String, options: 
> CSVOptions): DataType = {
> 10.               // This case infers a custom `dataFormat` is set.
> 11.               if ((allCatch opt 
> options.timestampFormat.parse(field)).isDefined) {
> 12.                 TimestampType
> 13.               } else if ((allCatch opt 
> DateTimeUtils.stringToTime(field)).isDefined) {
> 14.                 // We keep this for backwords competibility.
> 15.                 TimestampType
> 16.               } else {
> 17.                 tryParseBoolean(field, options)
> 18.               }
> 19.             }
> {code}
> Interestingly, the user-specified csv options are not at all used while 
> determining if the field is of type double (as we can see in line 2). We can 
> see that the options is used for timestamp type (line 11), which is why the 
> 'dateFormat' option does work. 
> However, when the field is NaN, it works because scala's toDouble function 
> does convert the string NaN to the double equivalent of NaN. (I tried it 
> using the shell):
> {code}
> scala> allCatch.opt(field.toDouble)
> res12: Option[Double] = Some(8.0942)
> scala> var field = "NaN";
> field: String = NaN
> scala> allCatch.opt(field.toDouble)
> res13: Option[Double] = Some(NaN)
> scala> var field = "Inf";
> field: String = Inf
> scala> allCatch.opt(field.toDouble)
> res14: Option[Double] = None
> {code}
> Interestingly, scala does have Double equivalents of Infinity and -Infinity 
> (but spark defaults are Inf and -Inf, which is why they don't work):
> {code}
> scala> field = "Infinity";
> field: String = Infinity
> scala> allCatch.opt(field.toDouble)
> res15: Option[Double] = Some(Infinity)
> scala> field = "-Infinity";
> field: String = -Infinity
> scala> allCatch.opt(field.toDouble)
> res16: Option[Double] = Some(-Infinity)
> {code}
> The following csv, when ingested with inferSchema = true, therefore 
> interprets the value column as a Double! (Regardless of the user-specified 
> options)
> {code}
> ID,name,value,irrational,prime,real
> 1,e,2.7,true,false,true
> 2,pi,3.14,true,false,true
> 3,inf,Infinity,false,false,true
> 4,-inf,-Infinity,false,false,true
> 5,i,NaN,false,false,false
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to