[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421688#comment-15421688 ] Barry Becker commented on SPARK-17039: -- I did notice that https://github.com/databricks/spark-csv/issues/308 was not ported to spark 2.x. IOW, when you specify the dateFormat when writing with format("csv") the dates get written as longs instead of dates with the specified dateFormat. I will open a separate issue for it. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421611#comment-15421611 ] Barry Becker commented on SPARK-17039: -- I was able to pull the patch (https://github.com/apache/spark/pull/14118) and verify that it fixes my issue with reading null values. I hope that this patch will make it in for 2.0.1. Thanks. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419768#comment-15419768 ] Barry Becker commented on SPARK-17039: -- I read the comments in SPARK-16462. It looks like it would fix this issue, but I have not tried it yet. I will look into trying to local build the tip of 2.0.1 later next week and try it out. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419730#comment-15419730 ] Liwei Lin commented on SPARK-17039: --- Thanks [~barrybecker4] for reporting this. Please also see https://issues.apache.org/jira/browse/SPARK-16462. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419073#comment-15419073 ] Barry Becker commented on SPARK-17039: -- There are literal ?'s in the datafile. The "nullValue" option indicates that those ?'s should be read as null values. I also added the "dateFormat" option which describes how the dates in the file should be read. Let me try to provide more information so you can reproduce. Here is the schema that I am specifiying (dfSchema above): {code} StructType(StructField(string normal,StringType,true), StructField(Years,TimestampType,true), StructField(Months,TimestampType,true), StructField(WeekDays,TimestampType,true), StructField(Days,TimestampType,true), StructField(DaysWithNull,TimestampType,true), StructField(Hours,TimestampType,true), StructField(Minutes,TimestampType,true), StructField(normal dates,TimestampType,true), StructField(Wide Range Dates,TimestampType,true), StructField(Narrow,TimestampType,true), StructField(Far Future,TimestampType,true), StructField(Mostly Null,TimestampType,true), StructField(All Same Date,TimestampType,true), StructField(Past/Future,TimestampType,true), StructField(All nulls,TimestampType,true), StructField(Seconds,TimestampType,true)) {code} and here is the contents of the csv datafile (note that there are lots of nulls). This worked using databricks spark-csv lib as a dependency in spark 1.6.2 {code} foo 2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:01:00 2007-11-09T00:00:00 1967-11-09T00:00:00 2015-03-09T12:00:00 2700-01-01T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 1983-03-09T00:00:00 ? 2015-03-09T12:01:00 bar 2016-03-09T00:00:00 2015-04-09T00:00:00 2015-03-10T00:00:00 2015-03-10T00:00:00 ? 2015-03-09T01:00:00 2015-03-09T00:03:00 2007-10-02T00:00:00 1987-10-02T00:00:00 2015-03-09T12:03:00 3701-01-01T00:00:00 2015-04-09T00:00:00 2015-03-09T00:00:00 1865-04-09T00:00:00 ? 2015-03-09T12:01:01 baz 2017-03-09T00:00:00 2015-05-09T00:00:00 2015-03-11T00:00:00 2015-03-11T00:00:00 2015-03-11T00:00:00 2015-03-09T02:00:00 2015-03-09T00:05:00 1999-04-04T03:00:00 1999-02-03T00:00:00 2015-03-09T12:08:00 4702-01-01T00:00:00 ? 2015-03-09T00:00:00 1777-05-09T00:00:00 ? 2015-03-09T12:01:03 but 2018-03-09T00:00:00 2015-06-09T00:00:00 2015-03-12T00:00:00 2015-03-12T00:00:00 2015-03-12T00:00:00 2015-03-09T03:00:00 2015-03-09T00:08:00 2025-10-10T00:00:00 2025-10-10T00:00:00 2015-03-09T12:10:00 4103-01-01T00:00:00 2015-06-09T00:00:00 2015-03-09T00:00:00 2089-06-09T00:00:00 ? 2015-03-09T12:01:05 fooo2019-03-09T00:00:00 2015-07-09T00:00:00 2015-03-13T00:00:00 2015-03-13T00:00:00 2015-03-13T00:00:00 2015-03-09T04:00:00 2015-03-09T00:09:00 2004-02-23T00:00:00 2004-02-23T00:00:00 2015-03-09T12:15:00 4204-01-01T00:00:00 ? 2015-03-09T00:00:00 2125-07-09T00:00:00 ? 2015-03-09T12:01:07 bar 2020-03-09T00:00:00 2015-08-09T00:00:00 2015-03-16T00:00:00 2015-03-14T00:00:00 2015-03-14T00:00:00 2015-03-09T05:00:00 2015-03-09T00:12:00 2019-03-04T00:00:00 3019-03-04T00:00:00 2015-03-09T12:20:00 4305-01-01T00:00:00 2015-08-09T00:00:00 2015-03-09T00:00:00 2215-08-09T00:00:00 ? 2015-03-09T12:01:09 baz 2021-03-09T00:00:00 2015-09-09T00:00:00 2015-03-17T00:00:00 2015-03-15T00:00:00 2015-03-15T00:00:00 2015-03-09T06:00:00 2015-03-09T00:20:00 1999-04-04T02:34:00 ? 2015-03-09T12:25:00 4406-01-01T00:00:00 2015-09-09T00:00:00 2015-03-09T00:00:00 1754-09-09T00:00:00 ? 2015-03-09T12:01:11 but 2022-03-09T00:00:00 2015-10-09T00:00:00 2015-03-18T00:00:00 2015-03-16T00:00:00 ? 2015-03-09T07:00:00 2015-03-09T00:30:00 1999-03-01T00:00:00 1909-03-01T00:00:00 2015-03-09T12:30:00 4507-01-01T00:00:00 ? 2015-03-09T00:00:00 1958-10-09T00:00:00 ? 2015-03-09T12:01:00 bar 2023-03-09T00:00:00 2015-11-09T00:00:00 2015-03-19T00:00:00 2015-03-17T00:00:00 2015-03-17T00:00:00 2015-03-09T08:00:00 2015-03-09T00:35:00 2001-02-12T00:00:00 ? 2015-03-09T12:35:00 4608-01-01T00:00:00 2015-11-09T00:00:00 2015-03-09T00:00:00 3000-11-09T00:00:00 ? 2015-03-09T12:01:00 here is a really really really long string value2024-03-09T00:00:00 2015-12-09T00:00:00 2015-03-20T00:00:00 2015-03-18T00:00:00 2015-03-18T00:00:00 2015-03-09T09:00:00
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419062#comment-15419062 ] Sean Owen commented on SPARK-17039: --- Oh right looked right past that. But if a date is null, converted to "?" per config, that wouldn't be a valid date right? > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419040#comment-15419040 ] Barry Becker commented on SPARK-17039: -- I do specify a schema (.schema(dfSchema)), and it says that the column is a date column. I left it out because there were lots of other columns, and I need to spend some time to simplify the example. This is from a unit test that worked fine using spark 1.6.2, but fails using spark 2.0.0. I'm pretty sure its a real bug. The example in the stack overflow post may provide a better reproducible case. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419017#comment-15419017 ] Sean Owen commented on SPARK-17039: --- Hm, how are they being parsed as dates -- or is that the issue? you don't infer or specify a schema but say the col is indeed a date column. If it's a date column, "?" is not valid, and the error is correct. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org