[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23202#discussion_r238696317 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala --- @@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends Serializable { compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType) case DoubleType => tryParseDouble(field) case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDate(field) --- End diff -- Done. Please, have a look at the changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23202#discussion_r238199960 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala --- @@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends Serializable { compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType) case DoubleType => tryParseDouble(field) case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDate(field) --- End diff -- I see. Can we try date first above? was wondering if there was a reason to try date first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23202#discussion_r238194142 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala --- @@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends Serializable { compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType) case DoubleType => tryParseDouble(field) case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDate(field) --- End diff -- Just in case, I did exact match here as well, see https://github.com/apache/spark/pull/23202/files#diff-17719da188b2c15129f848f654a0e6feR174 . If date parser didn't consume all input (`pos.getIndex != field.length`), it fails. If I move it up in type inferring pipeline, it should work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23202#discussion_r238141702 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala --- @@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends Serializable { compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType) case DoubleType => tryParseDouble(field) case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDate(field) --- End diff -- I mean, IIRC, if the pattern is, for instance, `-MM-dd`, 2010-10-10 and also 2018-12-02T21:04:00.123567 are parsed as dates because the current parsing library checks if the string is matched and ignore the rest of them. So, if we try date first, it will works for its default value but it we do some weird patterns, it wouldn't work again. I was thinking we can fix it if we use `DateTimeFormatter`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23202#discussion_r238141062 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala --- @@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends Serializable { compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType) case DoubleType => tryParseDouble(field) case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDate(field) --- End diff -- The problem here is it looks a bit odd that we try date type later. IIRC the root cause is related with date paring library. Couldn't we try date first if we switch the parsing library? I thought that's in progress. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/23202 [SPARK-26248][SQL] Infer date type from CSV ## What changes were proposed in this pull request? The `CSVInferSchema` class is extended to support inferring of `DateType` from CSV input. The attempt to infer `DateType` is performed after inferring `TimestampType`. ## How was this patch tested? Added new test for inferring date types from CSV . It was also tested by existing suites like `CSVInferSchemaSuite`, `CsvExpressionsSuite`, `CsvFunctionsSuite` and `CsvSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 csv-date-inferring Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23202.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23202 commit a8d27d64d45ca4ea207f0d403dba472b2f606968 Author: Maxim Gekk Date: 2018-12-02T21:57:02Z Test for inferring the date type commit fa915fd3fcab37be2c53bc23a77b55e3024b3994 Author: Maxim Gekk Date: 2018-12-02T21:57:21Z Inferring date type --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org