[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r924930337 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -148,7 +148,28 @@ class CSVOptions( // A language tag in IETF BCP 47 format val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US) - val dateFormatInRead: Option[String] = parameters.get("dateFormat") + /** + * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type). + * Disabled by default for backwards compatibility and performance. When enabled, date entries in + * timestamp columns will be cast to timestamp upon parsing. Not compatible with + * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters + */ + val inferDate = { +val inferDateFlag = getBool("inferDate") +if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { Review Comment: All tests have passed after the rollback - I think this is ready to merge -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r924026412 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -148,7 +148,28 @@ class CSVOptions( // A language tag in IETF BCP 47 format val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US) - val dateFormatInRead: Option[String] = parameters.get("dateFormat") + /** + * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type). + * Disabled by default for backwards compatibility and performance. When enabled, date entries in + * timestamp columns will be cast to timestamp upon parsing. Not compatible with + * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters + */ + val inferDate = { +val inferDateFlag = getBool("inferDate") +if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { Review Comment: @HyukjinKwon Do you have a suggestion for whether we can merge with this latest change or if we should roll it back? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r919291362 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -148,7 +148,28 @@ class CSVOptions( // A language tag in IETF BCP 47 format val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US) - val dateFormatInRead: Option[String] = parameters.get("dateFormat") + /** + * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type). + * Disabled by default for backwards compatibility and performance. When enabled, date entries in + * timestamp columns will be cast to timestamp upon parsing. Not compatible with + * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters + */ + val inferDate = { +val inferDateFlag = getBool("inferDate") +if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { Review Comment: In the [most recent commit](https://github.com/apache/spark/pull/36871/commits/e1170d0ee2027d810d2b23243602de147748838b), I implemented the suggestion from @cloud-fan to use the non-legacy parser for inference and allowing `inferDate=true` with `legacyTimeParserPolicy = LEGACY` ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -148,7 +148,28 @@ class CSVOptions( // A language tag in IETF BCP 47 format val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US) - val dateFormatInRead: Option[String] = parameters.get("dateFormat") + /** + * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type). + * Disabled by default for backwards compatibility and performance. When enabled, date entries in + * timestamp columns will be cast to timestamp upon parsing. Not compatible with + * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters + */ + val inferDate = { +val inferDateFlag = getBool("inferDate") +if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { Review Comment: In the [most recent commit](https://github.com/apache/spark/pull/36871/commits/e1170d0ee2027d810d2b23243602de147748838b), I implemented the suggestion from @cloud-fan to use the non-legacy parser for inference and allowing `inferDate=true` with `legacyTimeParserPolicy = LEGACY`. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r919291362 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -148,7 +148,28 @@ class CSVOptions( // A language tag in IETF BCP 47 format val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US) - val dateFormatInRead: Option[String] = parameters.get("dateFormat") + /** + * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type). + * Disabled by default for backwards compatibility and performance. When enabled, date entries in + * timestamp columns will be cast to timestamp upon parsing. Not compatible with + * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters + */ + val inferDate = { +val inferDateFlag = getBool("inferDate") +if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { Review Comment: In the [most recent commit](https://github.com/apache/spark/pull/36871/commits/e1170d0ee2027d810d2b23243602de147748838b), I implemented the suggestion from @cloud-fan to always use the non-legacy parser for inference and allowing `inferDate=true` with `legacyTimeParserPolicy = LEGACY`. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r918153782 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -148,7 +148,28 @@ class CSVOptions( // A language tag in IETF BCP 47 format val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US) - val dateFormatInRead: Option[String] = parameters.get("dateFormat") + /** + * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type). + * Disabled by default for backwards compatibility and performance. When enabled, date entries in + * timestamp columns will be cast to timestamp upon parsing. Not compatible with + * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters + */ + val inferDate = { +val inferDateFlag = getBool("inferDate") +if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { Review Comment: Does the `Iso8601DateFormatter` properly parse legacy date types? I'm inclined to believe it doesn't since the `getFormatter` returns a special `LegacyDateFormatter` when the `legacyTimeParserPolicy == LEGACY` https://github.com/apache/spark/blob/9c5c21ccc9c7f41a204a738cd540c14793ffc8cb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L174-L188 cc @bersprockets @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r918142012 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -117,7 +123,10 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { case LongType => tryParseLong(field) case _: DecimalType => tryParseDecimal(field) case DoubleType => tryParseDouble(field) +case DateType => tryParseDateTime(field) +case TimestampNTZType if options.inferDate => tryParseDateTime(field) Review Comment: Our expected behavior is that in a column with `TimestampType` entries and then `DateType` entries, the column will be inferred as `TimestampType`. Here, `tryParseTimestampNTZ` and `tryParseTimestamp` will not be able to parse the `DateType` entries that show up later in the column and the column will be promoted to string type. So we must use `tryParseDateTime` which will give a chance for the date to be parsed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r909084899 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -197,34 +198,46 @@ class UnivocityParser( Decimal(decimalParser(datum), dt.precision, dt.scale) } -case _: TimestampType => (d: String) => +case _: DateType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - timestampFormatter.parse(datum) + dateFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e) +DateTimeUtils.stringToDate(str).getOrElse(throw e) } } -case _: TimestampNTZType => (d: String) => - nullSafeDatum(d, name, nullable, options) { datum => -timestampNTZFormatter.parseWithoutTimeZone(datum, false) - } - -case _: DateType => (d: String) => +case _: TimestampType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - dateFormatter.parse(datum) + timestampFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToDate(str).getOrElse(throw e) +DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse { + // There may be date type entries in timestamp column due to schema inference + if (options.inferDate) { +daysToMicros(dateFormatter.parse(datum), options.zoneId) + } else { +throw(e) + } +} +} + } + +case _: TimestampNTZType => (d: String) => + nullSafeDatum(d, name, nullable, options) { datum => +try { + timestampNTZFormatter.parseWithoutTimeZone(datum, false) +} catch { + case NonFatal(e) if (options.inferDate) => +daysToMicros(dateFormatter.parse(datum), options.zoneId) Review Comment: Wow great catch! I hadn't fully considered the effect of user timeZone on TimestampNTZ parsing. I've fixed the error and I've modified this test to have a PST user and check that the parsed date is converted to a timestamp in UTC. https://github.com/Jonathancui123/spark/blob/8df7ebbd6c1c0d5bb875ce554026f9b9aeae148e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala#L368-L374 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r909084899 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -197,34 +198,46 @@ class UnivocityParser( Decimal(decimalParser(datum), dt.precision, dt.scale) } -case _: TimestampType => (d: String) => +case _: DateType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - timestampFormatter.parse(datum) + dateFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e) +DateTimeUtils.stringToDate(str).getOrElse(throw e) } } -case _: TimestampNTZType => (d: String) => - nullSafeDatum(d, name, nullable, options) { datum => -timestampNTZFormatter.parseWithoutTimeZone(datum, false) - } - -case _: DateType => (d: String) => +case _: TimestampType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - dateFormatter.parse(datum) + timestampFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToDate(str).getOrElse(throw e) +DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse { + // There may be date type entries in timestamp column due to schema inference + if (options.inferDate) { +daysToMicros(dateFormatter.parse(datum), options.zoneId) + } else { +throw(e) + } +} +} + } + +case _: TimestampNTZType => (d: String) => + nullSafeDatum(d, name, nullable, options) { datum => +try { + timestampNTZFormatter.parseWithoutTimeZone(datum, false) +} catch { + case NonFatal(e) if (options.inferDate) => +daysToMicros(dateFormatter.parse(datum), options.zoneId) Review Comment: Wow great catch! I hadn't fully considered the effect of user timeZone on TimestampNTZ parsing. I've fixed the error and I've modified this test to have a PST user and check that the parsed date is converted to a timestamp in UTC. https://github.com/Jonathancui123/spark/blob/8df7ebbd6c1c0d5bb875ce554026f9b9aeae148e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala#L368-L374 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r909085510 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala: ## @@ -358,4 +359,26 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper { Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC") check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern)) } + + test("SPARK-39469: dates should be parsed correctly in a timestamp column when inferDate=true") { +def checkDate(dataType: DataType): Unit = { + val timestampsOptions = +new CSVOptions(Map("inferDate" -> "true", "timestampFormat" -> "dd/MM/ HH:mm", + "timestampNTZFormat" -> "dd-MM- HH:mm", "dateFormat" -> "dd_MM_"), + false, DateTimeUtils.getZoneId("-08:00").toString) + // Use CSVOption ZoneId="-08:00" (PST) to test that Dates in TimestampNTZ column are always + // converted to their equivalent UTC timestamp + val dateString = "08_09_2001" + val expected = dataType match { +case TimestampType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.of("-08:00")) +case TimestampNTZType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.UTC) Review Comment: > I think zoneId should probably be UTC for timestamp_ntz. Otherwise, you end up with oddities like this... @bersprockets I've modified this test to have the user in PST and check that the parsed date is converted to a timestamp in UTC. This checks for the error you caught in your [previous comment](https://github.com/apache/spark/pull/36871#discussion_r906513077). Thanks! I think the PR should be ready to merge. Please let me know if there's anything else we need to fix. I'll keep an eye on the Github Actions results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r909085510 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala: ## @@ -358,4 +359,26 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper { Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC") check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern)) } + + test("SPARK-39469: dates should be parsed correctly in a timestamp column when inferDate=true") { +def checkDate(dataType: DataType): Unit = { + val timestampsOptions = +new CSVOptions(Map("inferDate" -> "true", "timestampFormat" -> "dd/MM/ HH:mm", + "timestampNTZFormat" -> "dd-MM- HH:mm", "dateFormat" -> "dd_MM_"), + false, DateTimeUtils.getZoneId("-08:00").toString) + // Use CSVOption ZoneId="-08:00" (PST) to test that Dates in TimestampNTZ column are always + // converted to their equivalent UTC timestamp + val dateString = "08_09_2001" + val expected = dataType match { +case TimestampType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.of("-08:00")) +case TimestampNTZType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.UTC) Review Comment: > I think zoneId should probably be UTC for timestamp_ntz. Otherwise, you end up with oddities like this... @bersprockets I've modified this test to have the user in PST and check that the parsed date is converted to a timestamp in UTC. This checks for the error you caught in your [previous comment](https://github.com/apache/spark/pull/36871#discussion_r906513077). Thanks! I think the PR should be ready to merge. Please let me know if there's anything else we need to fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r909085510 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala: ## @@ -358,4 +359,26 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper { Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC") check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern)) } + + test("SPARK-39469: dates should be parsed correctly in a timestamp column when inferDate=true") { +def checkDate(dataType: DataType): Unit = { + val timestampsOptions = +new CSVOptions(Map("inferDate" -> "true", "timestampFormat" -> "dd/MM/ HH:mm", + "timestampNTZFormat" -> "dd-MM- HH:mm", "dateFormat" -> "dd_MM_"), + false, DateTimeUtils.getZoneId("-08:00").toString) + // Use CSVOption ZoneId="-08:00" (PST) to test that Dates in TimestampNTZ column are always + // converted to their equivalent UTC timestamp + val dateString = "08_09_2001" + val expected = dataType match { +case TimestampType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.of("-08:00")) +case TimestampNTZType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.UTC) Review Comment: > I think zoneId should probably be UTC for timestamp_ntz. Otherwise, you end up with oddities like this... @bersprockets I've modified this test to have the user in PST and check that the parsed date is converted to a timestamp in UTC. This checks for the error you caught in your [previous comment](https://github.com/apache/spark/pull/36871#discussion_r906513077). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r909084899 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -197,34 +198,46 @@ class UnivocityParser( Decimal(decimalParser(datum), dt.precision, dt.scale) } -case _: TimestampType => (d: String) => +case _: DateType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - timestampFormatter.parse(datum) + dateFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e) +DateTimeUtils.stringToDate(str).getOrElse(throw e) } } -case _: TimestampNTZType => (d: String) => - nullSafeDatum(d, name, nullable, options) { datum => -timestampNTZFormatter.parseWithoutTimeZone(datum, false) - } - -case _: DateType => (d: String) => +case _: TimestampType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - dateFormatter.parse(datum) + timestampFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToDate(str).getOrElse(throw e) +DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse { + // There may be date type entries in timestamp column due to schema inference + if (options.inferDate) { +daysToMicros(dateFormatter.parse(datum), options.zoneId) + } else { +throw(e) + } +} +} + } + +case _: TimestampNTZType => (d: String) => + nullSafeDatum(d, name, nullable, options) { datum => +try { + timestampNTZFormatter.parseWithoutTimeZone(datum, false) +} catch { + case NonFatal(e) if (options.inferDate) => +daysToMicros(dateFormatter.parse(datum), options.zoneId) Review Comment: Wow great catch! I hadn't fully considered the effect of user timeZone on TimestampNTZ parsing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r909035305 ## docs/sql-data-sources-csv.md: ## @@ -108,6 +108,12 @@ Data source options of CSV can be set via: Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option. read + +inferDate +false +Whether or not to infer columns that satisfy the dateFormat option as Date. Requires inferSchema to be true. When false, columns with dates will be inferred as String (or as Timestamp if it fits the timestampFormat) Legacy date formats in Timestamp columns cannot be parsed with this option. Review Comment: This was in reference to the fact that only the non-legacy datePaser is used for parsing in timestamp columns - as explained in [the first comment of this thread](https://github.com/apache/spark/pull/36871/files#r903187641). I'll remove it. This sentence no longer makes sense since we don't support date inference for legacy date formats. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r909035305 ## docs/sql-data-sources-csv.md: ## @@ -108,6 +108,12 @@ Data source options of CSV can be set via: Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option. read + +inferDate +false +Whether or not to infer columns that satisfy the dateFormat option as Date. Requires inferSchema to be true. When false, columns with dates will be inferred as String (or as Timestamp if it fits the timestampFormat) Legacy date formats in Timestamp columns cannot be parsed with this option. Review Comment: This was in reference to the fact that only the non-legacy datePaser is used for parsing in timestamp columns - as explained in [the first comment of this thread](https://github.com/apache/spark/pull/36871/files#r903187641). That sentence no longer makes sense since we don't support date inference for legacy date formats. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r908942583 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -197,34 +199,50 @@ class UnivocityParser( Decimal(decimalParser(datum), dt.precision, dt.scale) } -case _: TimestampType => (d: String) => +case _: DateType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - timestampFormatter.parse(datum) + dateFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e) +DateTimeUtils.stringToDate(str).getOrElse(throw e) } } -case _: TimestampNTZType => (d: String) => - nullSafeDatum(d, name, nullable, options) { datum => -timestampNTZFormatter.parseWithoutTimeZone(datum, false) - } - -case _: DateType => (d: String) => +case _: TimestampType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - dateFormatter.parse(datum) + timestampFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToDate(str).getOrElse(throw e) +DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse { + // There may be date type entries in timestamp column due to schema inference + if (options.inferDate) { +daysToMicros(dateFormatter.parse(datum), options.zoneId) Review Comment: Thanks Bruce! This is great context! This will definitely be necessary if we want to support inference along with legacy date formats. Users on legacy dates will be unaffected by this change - how about we can open another ticket for date inference with legacy formats if the demand exists (and merge this PR without legacy date inference support)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r906439733 ## core/src/main/resources/error/error-classes.json: ## @@ -23,6 +23,12 @@ ], "sqlState" : "22005" }, + "CANNOT_INFER_DATE" : { +"message" : [ + "Cannot infer date in schema inference when LegacyTimeParserPolicy is \"LEGACY\". Legacy Date formatter does not support strict date format matching which is required to avoid inferring timestamps and other non-date entries to date." +], +"sqlState" : "22005" Review Comment: Changed to 22007 for date format error -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r906233209 ## core/src/main/resources/error/error-classes.json: ## @@ -23,6 +23,12 @@ ], "sqlState" : "22005" }, + "CANNOT_INFER_DATE" : { +"message" : [ + "Cannot infer date in schema inference when LegacyTimeParserPolicy is \"LEGACY\". Legacy Date formatter does not support strict date format matching which is required to avoid inferring timestamps and other non-date entries to date." +], +"sqlState" : "22005" Review Comment: Is this the right sqlState code? Tests would not pass until I added a sqlState code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r906232351 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -197,34 +199,50 @@ class UnivocityParser( Decimal(decimalParser(datum), dt.precision, dt.scale) } -case _: TimestampType => (d: String) => +case _: DateType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - timestampFormatter.parse(datum) + dateFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e) +DateTimeUtils.stringToDate(str).getOrElse(throw e) } } -case _: TimestampNTZType => (d: String) => - nullSafeDatum(d, name, nullable, options) { datum => -timestampNTZFormatter.parseWithoutTimeZone(datum, false) - } - -case _: DateType => (d: String) => +case _: TimestampType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - dateFormatter.parse(datum) + timestampFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToDate(str).getOrElse(throw e) +DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse { + // There may be date type entries in timestamp column due to schema inference + if (options.inferDate) { +daysToMicros(dateFormatter.parse(datum), options.zoneId) Review Comment: @bersprockets Thanks for the suggestion! Do you know what is the advantage of allowing Legacy Formatter? i.e. what is a date format that the legacy formatter can handle but the current formatter cannot? I'm wondering if there will be a sufficient population of users who want to infer date in schema and also use legacy date formats cc: @Yaohua628 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r905350901 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { case LongType => tryParseLong(field) case _: DecimalType => tryParseDecimal(field) case DoubleType => tryParseDouble(field) -case TimestampNTZType => tryParseTimestampNTZ(field) -case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDateTime(field) Review Comment: I think this change is necessary: Consider the column of a TimestampType followed by a DateType entry. We would expect this column to be inferred as a TimestampType column. `typeSoFar` will be `Timestamp` when `inferField` is called on the second entry which is `DateType`. We need logic in `inferField` to try and parse `DateType` even when `typeSoFar` is `Timestamp` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r905350901 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { case LongType => tryParseLong(field) case _: DecimalType => tryParseDecimal(field) case DoubleType => tryParseDouble(field) -case TimestampNTZType => tryParseTimestampNTZ(field) -case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDateTime(field) Review Comment: I this change is necessary: Consider the column of a TimestampType followed by a DateType entry. We would expect this column to be inferred as a TimestampType column. `typeSoFar` will be `Timestamp` when `inferField` is called on the second entry which is `DateType`. We need logic in `inferField` to try and parse `DateType` even when `typeSoFar` is `Timestamp` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r905350901 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { case LongType => tryParseLong(field) case _: DecimalType => tryParseDecimal(field) case DoubleType => tryParseDouble(field) -case TimestampNTZType => tryParseTimestampNTZ(field) -case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDateTime(field) Review Comment: I this change is necessary: Consider the column of a TimestampType followed by a DateType entry. We would expect this column to be inferred as a TimestampType column. `typeSoFar` will be `Timestamp` when `inferField` is called on the second entry which is DateType. We need logic in `inferField` to try and parse Date type even when `typeSoFar` is `Timestamp` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r905350901 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { case LongType => tryParseLong(field) case _: DecimalType => tryParseDecimal(field) case DoubleType => tryParseDouble(field) -case TimestampNTZType => tryParseTimestampNTZ(field) -case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDateTime(field) Review Comment: Consider the column of a TimestampType followed by a DateType entry. We would expect this column to be inferred as a TimestampType column. `typeSoFar` will be `Timestamp` when `inferField` is called on the second entry which is DateType. We need logic in `inferField` to try and parse Date type even when `typeSoFar` is `Timestamp` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r904402104 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -110,15 +116,43 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { def inferField(typeSoFar: DataType, field: String): DataType = { if (field == null || field.isEmpty || field == options.nullValue) { typeSoFar +} else +if (options.inferDate) { + val typeElemInfer = typeSoFar match { +case NullType => tryParseInteger(field) +case IntegerType => tryParseInteger(field) +case LongType => tryParseLong(field) +case _: DecimalType => tryParseDecimal(field) +case DoubleType => tryParseDouble(field) +case DateType => tryParseDateTime(field) +case TimestampNTZType => tryParseDateTime(field) +case TimestampType => tryParseDateTime(field) +case BooleanType => tryParseBoolean(field) +case StringType => StringType +case other: DataType => + throw QueryExecutionErrors.dataTypeUnexpectedError(other) + } + compatibleType(typeSoFar, typeElemInfer).getOrElse(StringType) } else { val typeElemInfer = typeSoFar match { case NullType => tryParseInteger(field) case IntegerType => tryParseInteger(field) case LongType => tryParseLong(field) case _: DecimalType => tryParseDecimal(field) case DoubleType => tryParseDouble(field) -case TimestampNTZType => tryParseTimestampNTZ(field) -case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDateTime(field) +case TimestampNTZType => + if (options.inferDate) { Review Comment: My bad, I meant to remove the first match expression but missed it during code cleanup -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r903186613 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala: ## @@ -2753,6 +2754,35 @@ abstract class CSVSuite } } } + + test("SPARK-39469: Infer schema for date type") { +val options = Map( + "header" -> "true", + "inferSchema" -> "true", + "timestampFormat" -> "-MM-dd'T'HH:mm", + "dateFormat" -> "-MM-dd", + "inferDate" -> "true") + +val results = spark.read + .format("csv") + .options(options) + .load(testFile(dateInferSchemaFile)) + +val expectedSchema = StructType(List(StructField("date", DateType), + StructField("timestamp-date", TimestampType), StructField("date-timestamp", TimestampType))) +assert(results.schema == expectedSchema) + +val expected = + Seq( +Seq(Date.valueOf("2001-9-8"), Timestamp.valueOf("2014-10-27 18:30:0.0"), + Timestamp.valueOf("1765-03-28 00:00:0.0")), +Seq(Date.valueOf("1941-1-2"), Timestamp.valueOf("2000-09-14 01:01:0.0"), + Timestamp.valueOf("1423-11-12 23:41:0.0")), +Seq(Date.valueOf("0293-11-7"), Timestamp.valueOf("1995-06-25 00:00:00.0"), + Timestamp.valueOf("2016-01-28 20:00:00.0")) + ) +assert(results.collect().toSeq.map(_.toSeq) == expected) + } Review Comment: > One test we might need would be timestampFormat" -> "dd/MM/ HH:mm and dateFormat -> dd/MM/ to make sure timestamps are not parsed as date types without conflicting. This test uses: ``` "timestampFormat" -> "-MM-dd'T'HH:mm", "dateFormat" -> "-MM-dd", ``` This e2e test ensures that our DateFormatter is using strict parsing. We will not infer Timestamp columns as Date columns if the `DateFormat` is a prefix of the `TimestampFormat`. Thank you for the review! @HyukjinKwon @bersprockets -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r903189990 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala: ## @@ -358,4 +358,19 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper { Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC") check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern)) } + + test("dates should be parsed correctly in a timestamp column") { +def checkDate(dataType: DataType): Unit = { + val timestampsOptions = +new CSVOptions(Map("timestampFormat" -> "dd/MM/ HH:mm", + "timestampNTZFormat" -> "dd-MM- HH:mm", "dateFormat" -> "dd_MM_"), Review Comment: I addressed inference mistakes in [the following code snippet and comment](https://github.com/apache/spark/pull/36871#discussion_r903186613) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r903187641 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -197,34 +199,50 @@ class UnivocityParser( Decimal(decimalParser(datum), dt.precision, dt.scale) } -case _: TimestampType => (d: String) => +case _: DateType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - timestampFormatter.parse(datum) + dateFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e) +DateTimeUtils.stringToDate(str).getOrElse(throw e) } } -case _: TimestampNTZType => (d: String) => - nullSafeDatum(d, name, nullable, options) { datum => -timestampNTZFormatter.parseWithoutTimeZone(datum, false) - } - -case _: DateType => (d: String) => +case _: TimestampType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - dateFormatter.parse(datum) + timestampFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToDate(str).getOrElse(throw e) +DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse { + // There may be date type entries in timestamp column due to schema inference + if (options.inferDate) { +daysToMicros(dateFormatter.parse(datum), options.zoneId) Review Comment: We do not use the legacy DateFormatter here to avoid parsing timestamps with invalid suffixes. We want to throw an error when invalid timstamps are given. e.g. The legacy DateFormatter will parse the following string without throwing an error: dateFormat: `-mm-dd` string: `2001-09-08-randomtext` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r903187641 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -197,34 +199,50 @@ class UnivocityParser( Decimal(decimalParser(datum), dt.precision, dt.scale) } -case _: TimestampType => (d: String) => +case _: DateType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - timestampFormatter.parse(datum) + dateFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e) +DateTimeUtils.stringToDate(str).getOrElse(throw e) } } -case _: TimestampNTZType => (d: String) => - nullSafeDatum(d, name, nullable, options) { datum => -timestampNTZFormatter.parseWithoutTimeZone(datum, false) - } - -case _: DateType => (d: String) => +case _: TimestampType => (d: String) => nullSafeDatum(d, name, nullable, options) { datum => try { - dateFormatter.parse(datum) + timestampFormatter.parse(datum) } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToDate(str).getOrElse(throw e) +DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse { + // There may be date type entries in timestamp column due to schema inference + if (options.inferDate) { +daysToMicros(dateFormatter.parse(datum), options.zoneId) Review Comment: We do not use the legacy DateFormatter here to avoid parsing timestamps with invalid suffixes. We want to throw an error. e.g. The legacy DateFormatter will parse the following string without throwing an error: dateFormat: `-mm-dd` string: `2001-09-08-randomtext` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r903186613 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala: ## @@ -2753,6 +2754,35 @@ abstract class CSVSuite } } } + + test("SPARK-39469: Infer schema for date type") { +val options = Map( + "header" -> "true", + "inferSchema" -> "true", + "timestampFormat" -> "-MM-dd'T'HH:mm", + "dateFormat" -> "-MM-dd", + "inferDate" -> "true") + +val results = spark.read + .format("csv") + .options(options) + .load(testFile(dateInferSchemaFile)) + +val expectedSchema = StructType(List(StructField("date", DateType), + StructField("timestamp-date", TimestampType), StructField("date-timestamp", TimestampType))) +assert(results.schema == expectedSchema) + +val expected = + Seq( +Seq(Date.valueOf("2001-9-8"), Timestamp.valueOf("2014-10-27 18:30:0.0"), + Timestamp.valueOf("1765-03-28 00:00:0.0")), +Seq(Date.valueOf("1941-1-2"), Timestamp.valueOf("2000-09-14 01:01:0.0"), + Timestamp.valueOf("1423-11-12 23:41:0.0")), +Seq(Date.valueOf("0293-11-7"), Timestamp.valueOf("1995-06-25 00:00:00.0"), + Timestamp.valueOf("2016-01-28 20:00:00.0")) + ) +assert(results.collect().toSeq.map(_.toSeq) == expected) + } Review Comment: > One test we might need would be timestampFormat" -> "dd/MM/ HH:mm and dateFormat -> dd/MM/ to make sure timestamps are not parsed as date types without conflicting. This test uses: ``` "timestampFormat" -> "-MM-dd'T'HH:mm", "dateFormat" -> "-MM-dd", ``` This e2e test ensures that our DateFormatter is using strict parsing. We will not infer Timestamp columns as Date columns if the `DateFormat` is a prefix of the `TimestampFormat`. @HyukjinKwon @bersprockets -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r903184632 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -206,28 +218,27 @@ class UnivocityParser( // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility. val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum)) -DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e) +DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse { + // There may be date type entries in timestamp column due to schema inference + convertToDate(datum) Review Comment: Consider the column of a `DateType` followed by a `TimestampType`. We would expect this column to be inferred as a `TimestampType` column. Thus, when parsing the column, the timestamp converter will fail on the Date entry so we will need to try and convert it with the Date converter. If both converters fail, then we will throw an error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r898694871 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala: ## @@ -358,4 +358,19 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper { Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC") check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern)) } + + test("dates should be parsed correctly in a timestamp column") { Review Comment: Got it! I was under the impression that was only for bug fixes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org