[ https://issues.apache.org/jira/browse/SPARK-21263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069911#comment-16069911 ]
Sean Owen commented on SPARK-21263: ----------------------------------- CC [~falaki] as well for the original code Yeah, tough one. The original code is trying to handle Locale, as I expected. The Spark version does not as (for other good reasons) it is not sensitive to the machine's locale. I think the right behavior is therefore to fail on this type of input. I think it's more a fix than behavior change, IMHO, because getting "10" out of "10u000" silently doesn't sound like a good idea. We could use {{.toDouble}}. We can also keep the current code but check whether it consumed all the input by checking {{ParsePosition}} afterwards. I note that, for example, the current code would parse "10e3" as "10", whereas {{.toDouble}} would parse as 10000.0. So using the latter does introduce small behavior changes, but again, it seems less surprising to parse that correctly as scientific notation, like standard JVM parsing routines would? > NumberFormatException is not thrown while converting an invalid string to > float/double > -------------------------------------------------------------------------------------- > > Key: SPARK-21263 > URL: https://issues.apache.org/jira/browse/SPARK-21263 > Project: Spark > Issue Type: Bug > Components: Java API > Affects Versions: 2.1.1 > Reporter: Navya Krishnappa > > When reading a below-mentioned data by specifying user-defined schema, > exception is not thrown. Refer the details : > *Data:* > 'PatientID','PatientName','TotalBill' > '1000','Patient1','10u000' > '1001','Patient2','30000' > '1002','Patient3','40000' > '1003','Patient4','50000' > '1004','Patient5','60000' > *Source code*: > Dataset dataset = sparkSession.read().schema(schema) > .option(INFER_SCHEMA, "true") > .option(DELIMITER, ",") > .option(QUOTE, "\"") > .option(MODE, Mode.PERMISSIVE) > .csv(sourceFile); > When we collect the dataset data: > dataset.collectAsList(); > *Schema1*: > [StructField(PatientID,IntegerType,true), > StructField(PatientName,StringType,true), > StructField(TotalBill,IntegerType,true)] > *Result *: Throws NumerFormatException > Caused by: java.lang.NumberFormatException: For input string: "10u000" > *Schema2*: > [StructField(PatientID,IntegerType,true), > StructField(PatientName,StringType,true), > StructField(TotalBill,DoubleType,true)] > *Actual Result*: > "PatientID": 1000, > "NumberOfVisits": "400", > "TotalBill": 10, > *Expected Result*: Should throw NumberFormatException for input string > "10u000" -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org