[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-04 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23202#discussion_r238696317
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 ---
@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends 
Serializable {
   compatibleType(typeSoFar, 
tryParseDecimal(field)).getOrElse(StringType)
 case DoubleType => tryParseDouble(field)
 case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDate(field)
--- End diff --

Done. Please, have a look at the changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23202#discussion_r238199960
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 ---
@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends 
Serializable {
   compatibleType(typeSoFar, 
tryParseDecimal(field)).getOrElse(StringType)
 case DoubleType => tryParseDouble(field)
 case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDate(field)
--- End diff --

I see. Can we try date first above? was wondering if there was a reason to 
try date first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-03 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23202#discussion_r238194142
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 ---
@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends 
Serializable {
   compatibleType(typeSoFar, 
tryParseDecimal(field)).getOrElse(StringType)
 case DoubleType => tryParseDouble(field)
 case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDate(field)
--- End diff --

Just in case, I did exact match here as well, see 
https://github.com/apache/spark/pull/23202/files#diff-17719da188b2c15129f848f654a0e6feR174
 . If date parser didn't consume all input (`pos.getIndex != field.length`), it 
fails. If I move it up in type inferring pipeline, it should work. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23202#discussion_r238141702
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 ---
@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends 
Serializable {
   compatibleType(typeSoFar, 
tryParseDecimal(field)).getOrElse(StringType)
 case DoubleType => tryParseDouble(field)
 case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDate(field)
--- End diff --

I mean, IIRC, if the pattern is, for instance, `-MM-dd`, 2010-10-10 and 
also 2018-12-02T21:04:00.123567 are parsed as dates because the current parsing 
library checks if the string is matched and ignore the rest of them.

So, if we try date first, it will works for its default value but it we do 
some weird patterns, it wouldn't work again.

I was thinking we can fix it if we use `DateTimeFormatter`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23202#discussion_r238141062
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 ---
@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends 
Serializable {
   compatibleType(typeSoFar, 
tryParseDecimal(field)).getOrElse(StringType)
 case DoubleType => tryParseDouble(field)
 case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDate(field)
--- End diff --

The problem here is it looks a bit odd that we try date type later. IIRC 
the root cause is related with date paring library. Couldn't we try date first 
if we switch the parsing library? I thought that's in progress.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-02 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/23202

[SPARK-26248][SQL] Infer date type from CSV

## What changes were proposed in this pull request?

The `CSVInferSchema` class is extended to support inferring of `DateType` 
from CSV input. The attempt to infer `DateType` is performed after inferring 
`TimestampType`.

## How was this patch tested?

Added new test for inferring date types from CSV . It was also tested by 
existing suites like `CSVInferSchemaSuite`, `CsvExpressionsSuite`, 
`CsvFunctionsSuite` and `CsvSuite`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 csv-date-inferring

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23202.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23202


commit a8d27d64d45ca4ea207f0d403dba472b2f606968
Author: Maxim Gekk 
Date:   2018-12-02T21:57:02Z

Test for inferring the date type

commit fa915fd3fcab37be2c53bc23a77b55e3024b3994
Author: Maxim Gekk 
Date:   2018-12-02T21:57:21Z

Inferring date type




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org