[jira] [Created] (SPARK-15125) CSV data source recognizes empty quoted strings in the input as null.

Suresh Thalamati (JIRA) Wed, 04 May 2016 10:57:29 -0700

Suresh Thalamati created SPARK-15125:
----------------------------------------


             Summary: CSV data source recognizes empty quoted strings in the 
input as null. 
                 Key: SPARK-15125
                 URL: https://issues.apache.org/jira/browse/SPARK-15125
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Suresh Thalamati


CSV data source does not differentiate between empty quoted strings and empty 
fields  as null. In some scenarios user would want  to differentiate between 
these values,  especially in the context of SQL where NULL , and empty string 
have different meanings  If input data happens to be dump from traditional 
relational data source, users will see different results for the SQL queries. 

{code}
Repro:

Test Data: (test.csv)
year,make,model,comment,price
2017,Tesla,Mode 3,looks nice.,35000.99
2016,Chevy,Bolt,"",29000.00
2015,Porsche,"",,

scala> val df= sqlContext.read.format("csv").option("header", 
"true").option("inferSchema", "true").option("nullValue", 
null).load("/tmp/test.csv")
df: org.apache.spark.sql.DataFrame = [year: int, make: string ... 3 more fields]

scala> df.show
+----+-------+------+-----------+--------+
|year|   make| model|    comment|   price|
+----+-------+------+-----------+--------+
|2017|  Tesla|Mode 3|looks nice.|35000.99|
|2016|  Chevy|  Bolt|       null| 29000.0|
|2015|Porsche|  null|       null|    null|
+----+-------+------+-----------+--------+

Expected:
+----+-------+------+-----------+--------+
|year|   make| model|    comment|   price|
+----+-------+------+-----------+--------+
|2017|  Tesla|Mode 3|looks nice.|35000.99|
|2016|  Chevy|  Bolt|           | 29000.0|
|2015|Porsche|      |       null|    null|
+----+-------+------+-----------+--------+

{code}

Testing a fix for the this issue. I will give a shot at submitting a PR for 
this soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15125) CSV data source recognizes empty quoted strings in the input as null.

Reply via email to