[jira] [Updated] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2018-05-20 Thread Sergey Rubtsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Rubtsov updated SPARK-19228:
---
Description: 
Current FastDateFormat parser can't properly parse date and timestamp and does 
not meet the ISO8601.
 For example, I need to process user.csv like this:
{code:java}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code:java}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be
{code:java}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is:
{code:java}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)
{code}
This mean that date processed as string and "dateFormat" option is ignored.
 If I add option
{code:java}
.option("timestampFormat", "dd/MM/")
{code}
result is:
{code:java}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}
 

  was:
Current FastDateFormat can't properly parse date and timestamp and does not 
meet the ISO8601.

That is why there is now supporting for inferring DateType and custom 
"dateFormat" option for csv parsing.
For example, I need to process user.csv like this:
{code:java}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code:java}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be
{code:java}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is:
{code:java}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)
{code}
This mean that date processed as string and "dateFormat" option is ignored.
 If I add option
{code:java}
.option("timestampFormat", "dd/MM/")
{code}
result is:
{code:java}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}
I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
 method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.


> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Current FastDateFormat parser can't properly parse date and timestamp and 
> does not meet the ISO8601.
>  For example, I need to process user.csv like this:
> {code:java}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code:java}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be
> {code:java}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is:
> {code:java}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This mean that date processed as string and "dateFormat" option is ignored.
>  If I add option
> {code:java}
> .option("timestampFormat", "dd/MM/")
> {code}
> result is:
> {code:java}

[jira] [Updated] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2018-05-18 Thread Sergey Rubtsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Rubtsov updated SPARK-19228:
---
Description: 
Current FastDateFormat can't properly parse date and timestamp and does not 
meet the ISO8601.

That is why there is now supporting for inferring DateType and custom 
"dateFormat" option for csv parsing.
For example, I need to process user.csv like this:
{code:java}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code:java}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be
{code:java}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is:
{code:java}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)
{code}
This mean that date processed as string and "dateFormat" option is ignored.
 If I add option
{code:java}
.option("timestampFormat", "dd/MM/")
{code}
result is:
{code:java}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}
I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
 method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.

  was:
I need to process user.csv like this:
{code}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)
{code}
This mean that date processed as string and "dateFormat" option is ignored.
If I add option 
{code}
.option("timestampFormat", "dd/MM/")
{code}
result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}

I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.


> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Current FastDateFormat can't properly parse date and timestamp and does not 
> meet the ISO8601.
> That is why there is now supporting for inferring DateType and custom 
> "dateFormat" option for csv parsing.
> For example, I need to process user.csv like this:
> {code:java}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code:java}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be
> {code:java}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is:
> {code:java}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This 

[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2018-05-17 Thread Sergey Rubtsov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478811#comment-16478811
 ] 

Sergey Rubtsov commented on SPARK-19228:


Java 8 contains new java.time module, also it can fix an old bug with parse 
string to SQL's timestamp value in microseconds accuracy:

https://issues.apache.org/jira/browse/SPARK-10681.x

> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> I need to process user.csv like this:
> {code}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This mean that date processed as string and "dateFormat" option is ignored.
> If I add option 
> {code}
> .option("timestampFormat", "dd/MM/")
> {code}
> result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: timestamp (nullable = true)
>  |-- ended: timestamp (nullable = true)
> {code}
> I think, the issue is somewhere in object CSVInferSchema, function 
> inferField, lines 80-97 and
> method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
> date/timestamp process logic need to be changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2017-03-15 Thread Sergey Rubtsov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925775#comment-15925775
 ] 

Sergey Rubtsov commented on SPARK-19228:


Hi [~hyukjin.kwon], 
Updated pull request: 
https://github.com/apache/spark/pull/16735
Please, take a look.
Couldn't run tests in CSVSuite locally on my Windows OS, apologize for the 
possible test fails

> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> I need to process user.csv like this:
> {code}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This mean that date processed as string and "dateFormat" option is ignored.
> If I add option 
> {code}
> .option("timestampFormat", "dd/MM/")
> {code}
> result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: timestamp (nullable = true)
>  |-- ended: timestamp (nullable = true)
> {code}
> I think, the issue is somewhere in object CSVInferSchema, function 
> inferField, lines 80-97 and
> method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
> date/timestamp process logic need to be changed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2017-01-17 Thread Sergey Rubtsov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825860#comment-15825860
 ] 

Sergey Rubtsov commented on SPARK-19228:


Okey, I will do it. 

> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> I need to process user.csv like this:
> {code}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This mean that date processed as string and "dateFormat" option is ignored.
> If I add option 
> {code}
> .option("timestampFormat", "dd/MM/")
> {code}
> result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: timestamp (nullable = true)
>  |-- ended: timestamp (nullable = true)
> {code}
> I think, the issue is somewhere in object CSVInferSchema, function 
> inferField, lines 80-97 and
> method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
> date/timestamp process logic need to be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2017-01-17 Thread Sergey Rubtsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Rubtsov updated SPARK-19228:
---
Description: 
I need to process user.csv like this:
{code}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)
{code}
This mean that date processed as string and "dateFormat" option is ignored.
If I add option 
{code}
.option("timestampFormat", "dd/MM/")
{code}
result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}

I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.

  was:
I need to process user.csv like this:
{code}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)
{code}
This mean that date processed as string and "dateFormat" option is ignored and 
date processed as string.
If I add option 
{code}
.option("timestampFormat", "dd/MM/")
{code}
result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}

I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.


> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> I need to process user.csv like this:
> {code}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This mean that date processed as string and "dateFormat" option is ignored.
> If I add option 
> {code}
> .option("timestampFormat", "dd/MM/")
> {code}
> result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: timestamp (nullable = true)
>  |-- ended: timestamp (nullable = true)
> {code}
> I think, the issue is somewhere in object CSVInferSchema, function 
> inferField, lines 80-97 and
> method "tryParseDate" need to 

[jira] [Updated] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2017-01-15 Thread Sergey Rubtsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Rubtsov updated SPARK-19228:
---
Description: 
I need to process user.csv like this:
{code}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)
{code}
This mean that date processed as string and "dateFormat" option is ignored and 
date processed as string.
If I add option 
{code}
.option("timestampFormat", "dd/MM/")
{code}
result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}

I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.

  was:
I need to process user.csv like this:
{code}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)

This mean that date processed as string and "dateFormat" option is ignored and 
date processed as string.
If I add option 
{code}
.option("timestampFormat", "dd/MM/")
{code}
result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}

I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.


> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> I need to process user.csv like this:
> {code}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This mean that date processed as string and "dateFormat" option is ignored 
> and date processed as string.
> If I add option 
> {code}
> .option("timestampFormat", "dd/MM/")
> {code}
> result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: timestamp (nullable = true)
>  |-- ended: timestamp (nullable = true)
> {code}
> I think, the issue is somewhere in object CSVInferSchema, function 
> 

[jira] [Created] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2017-01-15 Thread Sergey Rubtsov (JIRA)
Sergey Rubtsov created SPARK-19228:
--

 Summary: inferSchema function processed csv date column as string 
and "dateFormat" DataSource option is ignored
 Key: SPARK-19228
 URL: https://issues.apache.org/jira/browse/SPARK-19228
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Affects Versions: 2.1.0
Reporter: Sergey Rubtsov


I need to process user.csv like this:
{code}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)

This mean that date processed as string and "dateFormat" option is ignored and 
date processed as string.
If I add option 
{code}
.option("timestampFormat", "dd/MM/")
{code}
result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}

I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org