[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-15 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421688#comment-15421688
 ] 

Barry Becker commented on SPARK-17039:
--

I did notice that https://github.com/databricks/spark-csv/issues/308 was not 
ported to spark 2.x. IOW, when you specify the dateFormat when writing with 
format("csv") the dates get written as longs instead of dates with the 
specified dateFormat. I will open a separate issue for it.

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .option("dateFormat", "-MM-dd'T'HH:mm:ss")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-15 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421611#comment-15421611
 ] 

Barry Becker commented on SPARK-17039:
--

I was able to pull the patch (https://github.com/apache/spark/pull/14118) and 
verify that it fixes my issue with reading null values. I hope that this patch 
will make it in for 2.0.1. Thanks.

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .option("dateFormat", "-MM-dd'T'HH:mm:ss")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419768#comment-15419768
 ] 

Barry Becker commented on SPARK-17039:
--

I read the comments in SPARK-16462. It looks like it would fix this issue, but 
I have not tried it yet. I will look into trying to local build the tip of 
2.0.1 later next week and try it out.

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .option("dateFormat", "-MM-dd'T'HH:mm:ss")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419730#comment-15419730
 ] 

Liwei Lin commented on SPARK-17039:
---

Thanks [~barrybecker4] for reporting this. Please also see 
https://issues.apache.org/jira/browse/SPARK-16462.

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .option("dateFormat", "-MM-dd'T'HH:mm:ss")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419073#comment-15419073
 ] 

Barry Becker commented on SPARK-17039:
--

There are literal ?'s in the datafile. The "nullValue" option indicates that 
those ?'s should be read as null values. I also added the "dateFormat" option 
which describes how the dates in the file should be read.

Let me try to provide more information so you can reproduce.

Here is the schema that I am specifiying (dfSchema above):
{code}
StructType(StructField(string normal,StringType,true), 
StructField(Years,TimestampType,true), StructField(Months,TimestampType,true), 
StructField(WeekDays,TimestampType,true), StructField(Days,TimestampType,true), 
StructField(DaysWithNull,TimestampType,true), 
StructField(Hours,TimestampType,true), StructField(Minutes,TimestampType,true), 
StructField(normal dates,TimestampType,true), StructField(Wide Range 
Dates,TimestampType,true), StructField(Narrow,TimestampType,true), 
StructField(Far Future,TimestampType,true), StructField(Mostly 
Null,TimestampType,true), StructField(All Same Date,TimestampType,true), 
StructField(Past/Future,TimestampType,true), StructField(All 
nulls,TimestampType,true), StructField(Seconds,TimestampType,true))
{code}

and here is the contents of the csv datafile (note that there are lots of 
nulls). This worked using databricks spark-csv lib as a dependency in spark 
1.6.2
{code}
foo 2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 
2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 
2015-03-09T00:01:00 2007-11-09T00:00:00 1967-11-09T00:00:00 
2015-03-09T12:00:00 2700-01-01T00:00:00 2015-03-09T00:00:00 
2015-03-09T00:00:00 1983-03-09T00:00:00 ?   2015-03-09T12:01:00
bar 2016-03-09T00:00:00 2015-04-09T00:00:00 2015-03-10T00:00:00 
2015-03-10T00:00:00 ?   2015-03-09T01:00:00 2015-03-09T00:03:00 
2007-10-02T00:00:00 1987-10-02T00:00:00 2015-03-09T12:03:00 
3701-01-01T00:00:00 2015-04-09T00:00:00 2015-03-09T00:00:00 
1865-04-09T00:00:00 ?   2015-03-09T12:01:01
baz 2017-03-09T00:00:00 2015-05-09T00:00:00 2015-03-11T00:00:00 
2015-03-11T00:00:00 2015-03-11T00:00:00 2015-03-09T02:00:00 
2015-03-09T00:05:00 1999-04-04T03:00:00 1999-02-03T00:00:00 
2015-03-09T12:08:00 4702-01-01T00:00:00 ?   2015-03-09T00:00:00 
1777-05-09T00:00:00 ?   2015-03-09T12:01:03
but 2018-03-09T00:00:00 2015-06-09T00:00:00 2015-03-12T00:00:00 
2015-03-12T00:00:00 2015-03-12T00:00:00 2015-03-09T03:00:00 
2015-03-09T00:08:00 2025-10-10T00:00:00 2025-10-10T00:00:00 
2015-03-09T12:10:00 4103-01-01T00:00:00 2015-06-09T00:00:00 
2015-03-09T00:00:00 2089-06-09T00:00:00 ?   2015-03-09T12:01:05
fooo2019-03-09T00:00:00 2015-07-09T00:00:00 2015-03-13T00:00:00 
2015-03-13T00:00:00 2015-03-13T00:00:00 2015-03-09T04:00:00 
2015-03-09T00:09:00 2004-02-23T00:00:00 2004-02-23T00:00:00 
2015-03-09T12:15:00 4204-01-01T00:00:00 ?   2015-03-09T00:00:00 
2125-07-09T00:00:00 ?   2015-03-09T12:01:07
bar 2020-03-09T00:00:00 2015-08-09T00:00:00 2015-03-16T00:00:00 
2015-03-14T00:00:00 2015-03-14T00:00:00 2015-03-09T05:00:00 
2015-03-09T00:12:00 2019-03-04T00:00:00 3019-03-04T00:00:00 
2015-03-09T12:20:00 4305-01-01T00:00:00 2015-08-09T00:00:00 
2015-03-09T00:00:00 2215-08-09T00:00:00 ?   2015-03-09T12:01:09
baz 2021-03-09T00:00:00 2015-09-09T00:00:00 2015-03-17T00:00:00 
2015-03-15T00:00:00 2015-03-15T00:00:00 2015-03-09T06:00:00 
2015-03-09T00:20:00 1999-04-04T02:34:00 ?   2015-03-09T12:25:00 
4406-01-01T00:00:00 2015-09-09T00:00:00 2015-03-09T00:00:00 
1754-09-09T00:00:00 ?   2015-03-09T12:01:11
but 2022-03-09T00:00:00 2015-10-09T00:00:00 2015-03-18T00:00:00 
2015-03-16T00:00:00 ?   2015-03-09T07:00:00 2015-03-09T00:30:00 
1999-03-01T00:00:00 1909-03-01T00:00:00 2015-03-09T12:30:00 
4507-01-01T00:00:00 ?   2015-03-09T00:00:00 1958-10-09T00:00:00 
?   2015-03-09T12:01:00
bar 2023-03-09T00:00:00 2015-11-09T00:00:00 2015-03-19T00:00:00 
2015-03-17T00:00:00 2015-03-17T00:00:00 2015-03-09T08:00:00 
2015-03-09T00:35:00 2001-02-12T00:00:00 ?   2015-03-09T12:35:00 
4608-01-01T00:00:00 2015-11-09T00:00:00 2015-03-09T00:00:00 
3000-11-09T00:00:00 ?   2015-03-09T12:01:00
here is a really really really long string value2024-03-09T00:00:00 
2015-12-09T00:00:00 2015-03-20T00:00:00 2015-03-18T00:00:00 
2015-03-18T00:00:00 2015-03-09T09:00:00 

[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419062#comment-15419062
 ] 

Sean Owen commented on SPARK-17039:
---

Oh right looked right past that. But if a date is null, converted to "?" per 
config, that wouldn't be a valid date right?

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419040#comment-15419040
 ] 

Barry Becker commented on SPARK-17039:
--

I do specify a schema (.schema(dfSchema)), and it says that the column is a 
date column. I left it out because there were lots of other columns, and I need 
to spend some time to simplify the example. This is from a unit test that 
worked fine using spark 1.6.2, but fails using spark 2.0.0. I'm pretty sure its 
a real bug. The example in the stack overflow post may provide a better 
reproducible case. 

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419017#comment-15419017
 ] 

Sean Owen commented on SPARK-17039:
---

Hm, how are they being parsed as dates -- or is that the issue? you don't infer 
or specify a schema but say the col is indeed a date column. If it's a date 
column, "?" is not valid, and the error is correct.

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org