[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

Stephane Maarek (JIRA) Tue, 12 Apr 2016 20:52:45 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stephane Maarek updated SPARK-14586:
------------------------------------
    Description: 
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a       2
NULL    3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+--------+--------+
|column_1|column_2|
+--------+--------+
|       a|    null|
|    null|    3.00|
+--------+--------+

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't 
have a similar parsing behavior for decimals. I wouldn't say it is a bug per 
se, but it looks like a necessary improvement for the two engines to converge. 
Hive version is 1.5.1

Not sure if relevant, but Scala does parse numbers with leading space correctly

{code}
scala> "2.0".toDouble
res21: Double = 2.0

scala> " 2.0".toDouble
res22: Double = 2.0
{code}

  was:
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a       2
NULL    3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+--------+--------+
|column_1|column_2|
+--------+--------+
|       a|    null|
|    null|    3.00|
+--------+--------+

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1

Not sure if relevant, but Scala does parse numbers with leading space correctly

{code}
scala> "2.0".toDouble
res21: Double = 2.0

scala> " 2.0".toDouble
res22: Double = 2.0
{code}


> SparkSQL doesn't parse decimal like Hive
> ----------------------------------------
>
>                 Key: SPARK-14586
>                 URL: https://issues.apache.org/jira/browse/SPARK-14586
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.1
>            Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a       2
> NULL    3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +--------+--------+
> |column_1|column_2|
> +--------+--------+
> |       a|    null|
> |    null|    3.00|
> +--------+--------+
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't 
> have a similar parsing behavior for decimals. I wouldn't say it is a bug per 
> se, but it looks like a necessary improvement for the two engines to 
> converge. Hive version is 1.5.1
> Not sure if relevant, but Scala does parse numbers with leading space 
> correctly
> {code}
> scala> "2.0".toDouble
> res21: Double = 2.0
> scala> " 2.0".toDouble
> res22: Double = 2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

Reply via email to