Marcel Boldt created SPARK-16460: ------------------------------------ Summary: Spark 2.0 CSV ignores NULL value in Date format Key: SPARK-16460 URL: https://issues.apache.org/jira/browse/SPARK-16460 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.0.0 Environment: SparkR Reporter: Marcel Boldt Priority: Critical
Trying to read a CSV file to Spark (using SparkR) containing just this data row: {code} 1|1998-01-01|| {code} Using Spark 1.6.2 (Hadoop 2.6) gives me {code} > head(sdf) id d dtwo 1 1 1998-01-01 NA {code} Spark 2.0 preview (Hadoop 2.7, Rev. 14308) fails with error: {panel} > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.text.ParseException: Unparseable date: "" at java.text.DateFormat.parse(DateFormat.java:357) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:289) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:98) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:74) at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124) at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Itera... {panel} The problem seems indeed the NULL value here as with a valid date in the third CSV column it works. R code: {code} #Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6') Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7') .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) library(SparkR) sc <- sparkR.init( master = "local", sparkPackages = "com.databricks:spark-csv_2.11:1.4.0" ) sqlContext <- sparkRSQL.init(sc) st <- structType(structField("id", "integer"), structField("d", "date"), structField("dtwo", "date")) sdf <- read.df( sqlContext, path = "d:/date_test.csv", source = "com.databricks.spark.csv", schema = st, inferSchema = "false", delimiter = "|", dateFormat = "yyyy-MM-dd", nullValue = "", mode = "PERMISSIVE" ) head(sdf) sparkR.stop() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org