Suresh Thalamati created SPARK-15125: ----------------------------------------
Summary: CSV data source recognizes empty quoted strings in the input as null. Key: SPARK-15125 URL: https://issues.apache.org/jira/browse/SPARK-15125 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Suresh Thalamati CSV data source does not differentiate between empty quoted strings and empty fields as null. In some scenarios user would want to differentiate between these values, especially in the context of SQL where NULL , and empty string have different meanings If input data happens to be dump from traditional relational data source, users will see different results for the SQL queries. {code} Repro: Test Data: (test.csv) year,make,model,comment,price 2017,Tesla,Mode 3,looks nice.,35000.99 2016,Chevy,Bolt,"",29000.00 2015,Porsche,"",, scala> val df= sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("nullValue", null).load("/tmp/test.csv") df: org.apache.spark.sql.DataFrame = [year: int, make: string ... 3 more fields] scala> df.show +----+-------+------+-----------+--------+ |year| make| model| comment| price| +----+-------+------+-----------+--------+ |2017| Tesla|Mode 3|looks nice.|35000.99| |2016| Chevy| Bolt| null| 29000.0| |2015|Porsche| null| null| null| +----+-------+------+-----------+--------+ Expected: +----+-------+------+-----------+--------+ |year| make| model| comment| price| +----+-------+------+-----------+--------+ |2017| Tesla|Mode 3|looks nice.|35000.99| |2016| Chevy| Bolt| | 29000.0| |2015|Porsche| | null| null| +----+-------+------+-----------+--------+ {code} Testing a fix for the this issue. I will give a shot at submitting a PR for this soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org