[ 
https://issues.apache.org/jira/browse/SPARK-21263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069911#comment-16069911
 ] 

Sean Owen commented on SPARK-21263:
-----------------------------------

CC [~falaki] as well for the original code

Yeah, tough one. The original code is trying to handle Locale, as I expected. 
The Spark version does not as (for other good reasons) it is not sensitive to 
the machine's locale.

I think the right behavior is therefore to fail on this type of input. I think 
it's more a fix than behavior change, IMHO, because getting "10" out of 
"10u000" silently doesn't sound like a good idea.

We could use {{.toDouble}}. We can also keep the current code but check whether 
it consumed all the input by checking {{ParsePosition}} afterwards. I note 
that, for example, the current code would parse "10e3" as "10", whereas 
{{.toDouble}} would parse as 10000.0. So using the latter does introduce small 
behavior changes, but again, it seems less surprising to parse that correctly 
as scientific notation, like standard JVM parsing routines would?

> NumberFormatException is not thrown while converting an invalid string to 
> float/double
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-21263
>                 URL: https://issues.apache.org/jira/browse/SPARK-21263
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.1.1
>            Reporter: Navya Krishnappa
>
> When reading a below-mentioned data by specifying user-defined schema, 
> exception is not thrown. Refer the details :
> *Data:* 
> 'PatientID','PatientName','TotalBill'
> '1000','Patient1','10u000'
> '1001','Patient2','30000'
> '1002','Patient3','40000'
> '1003','Patient4','50000'
> '1004','Patient5','60000'
> *Source code*: 
> Dataset dataset = sparkSession.read().schema(schema)
> .option(INFER_SCHEMA, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> When we collect the dataset data: 
> dataset.collectAsList();
> *Schema1*: 
> [StructField(PatientID,IntegerType,true), 
> StructField(PatientName,StringType,true), 
> StructField(TotalBill,IntegerType,true)]
> *Result *: Throws NumerFormatException 
> Caused by: java.lang.NumberFormatException: For input string: "10u000"
> *Schema2*: 
> [StructField(PatientID,IntegerType,true), 
> StructField(PatientName,StringType,true), 
> StructField(TotalBill,DoubleType,true)]
> *Actual Result*: 
> "PatientID": 1000,
> "NumberOfVisits": "400",
> "TotalBill": 10,
> *Expected Result*: Should throw NumberFormatException for input string 
> "10u000"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to