[ https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243702#comment-15243702 ]
Suresh Thalamati commented on SPARK-14586: ------------------------------------------ Thanks for reporting this issue , Stephane. which version of Hive are you are using ? I took a quick look at the code , here is what I found: type decimal(4,2) will map to BigDecimal not double. BigDecimal parsing fails if there are spaces. {code} scala> BigDecimal(" 2.0") java.lang.NumberFormatException at java.math.BigDecimal.<init>(BigDecimal.java:494) at java.math.BigDecimal.<init>(BigDecimal.java:383) {code} Spark SQL also relies on HiveDecimal to convert the string to BigDecimal value. Hive made fix in 2.0 release to trim decimal input string. https://issues.apache.org/jira/browse/HIVE-12343 https://issues.apache.org/jira/browse/HIVE-10799 commit : https://github.com/apache/hive/commit/c178a6e9d12055e5bde634123ca58f243ae39477 {code} common/src/java/org/apache/hadoop/hive/common/type/HiveDecimal.java public static HiveDecimal create(String dec) { BigDecimal bd; try { - bd = new BigDecimal(dec); + bd = new BigDecimal(dec.trim()); } catch (NumberFormatException ex) { return null; } {code} When Spark moves to 2.0 version of Hive, decimal parsing should behave same as Hive. I am not sure about the plans to upgrade Hive version inside Spark. Copying Yin Hui. [~yhuai] > SparkSQL doesn't parse decimal like Hive > ---------------------------------------- > > Key: SPARK-14586 > URL: https://issues.apache.org/jira/browse/SPARK-14586 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.1 > Reporter: Stephane Maarek > > create a test_data.csv with the following > {code:none} > a, 2.0 > ,3.0 > {code} > (the space is intended before the 2) > copy the test_data.csv to hdfs:///spark_testing_2 > go in hive, run the following statements > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv_2; > CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( > column_1 varchar(10), > column_2 decimal(4,2)) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing_2' > TBLPROPERTIES('serialization.null.format'=''); > select * from spark_testing.test_csv_2; > OK > a 2 > NULL 3 > {code} > As you can see, the value " 2" gets parsed correctly to 2 > Now onto Spark-shell: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv_2").show() > +--------+--------+ > |column_1|column_2| > +--------+--------+ > | a| null| > | null| 3.00| > +--------+--------+ > {code} > As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't > have a similar parsing behavior for decimals. I wouldn't say it is a bug per > se, but it looks like a necessary improvement for the two engines to > converge. Hive version is 1.5.1 > Not sure if relevant, but Scala does parse numbers with leading space > correctly > {code} > scala> "2.0".toDouble > res21: Double = 2.0 > scala> " 2.0".toDouble > res22: Double = 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org