[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-19 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r924930337


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = 
parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise 
inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   */
+  val inferDate = {
+val inferDateFlag = getBool("inferDate")
+if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {

Review Comment:
   All tests have passed after the rollback - I think this is ready to merge



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-18 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r924026412


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = 
parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise 
inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   */
+  val inferDate = {
+val inferDateFlag = getBool("inferDate")
+if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {

Review Comment:
   @HyukjinKwon Do you have a suggestion for whether we can merge with this 
latest change or if we should roll it back?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-12 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r919291362


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = 
parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise 
inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   */
+  val inferDate = {
+val inferDateFlag = getBool("inferDate")
+if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {

Review Comment:
   In the [most recent 
commit](https://github.com/apache/spark/pull/36871/commits/e1170d0ee2027d810d2b23243602de147748838b),
 I implemented the suggestion from @cloud-fan to use the non-legacy parser for 
inference and allowing `inferDate=true` with `legacyTimeParserPolicy = LEGACY`



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = 
parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise 
inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   */
+  val inferDate = {
+val inferDateFlag = getBool("inferDate")
+if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {

Review Comment:
   In the [most recent 
commit](https://github.com/apache/spark/pull/36871/commits/e1170d0ee2027d810d2b23243602de147748838b),
 I implemented the suggestion from @cloud-fan to use the non-legacy parser for 
inference and allowing `inferDate=true` with `legacyTimeParserPolicy = LEGACY`. 
What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-12 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r919291362


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = 
parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise 
inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   */
+  val inferDate = {
+val inferDateFlag = getBool("inferDate")
+if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {

Review Comment:
   In the [most recent 
commit](https://github.com/apache/spark/pull/36871/commits/e1170d0ee2027d810d2b23243602de147748838b),
 I implemented the suggestion from @cloud-fan to always use the non-legacy 
parser for inference and allowing `inferDate=true` with `legacyTimeParserPolicy 
= LEGACY`. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-11 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r918153782


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = 
parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise 
inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   */
+  val inferDate = {
+val inferDateFlag = getBool("inferDate")
+if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {

Review Comment:
   Does the `Iso8601DateFormatter` properly parse legacy date types? I'm 
inclined to believe it doesn't since the `getFormatter` returns a special 
`LegacyDateFormatter` when the `legacyTimeParserPolicy == LEGACY`
   
   
https://github.com/apache/spark/blob/9c5c21ccc9c7f41a204a738cd540c14793ffc8cb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L174-L188
   
   cc @bersprockets @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-11 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r918142012


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -117,7 +123,10 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 case LongType => tryParseLong(field)
 case _: DecimalType => tryParseDecimal(field)
 case DoubleType => tryParseDouble(field)
+case DateType => tryParseDateTime(field)
+case TimestampNTZType if options.inferDate => tryParseDateTime(field)

Review Comment:
   Our expected behavior is that in a column with `TimestampType` entries and 
then `DateType` entries, the column will be inferred as `TimestampType`. 
   
   Here, `tryParseTimestampNTZ` and `tryParseTimestamp` will not be able to 
parse the `DateType` entries that show up later in the column and the column 
will be promoted to string type. So we must use `tryParseDateTime` which will 
give a chance for the date to be parsed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-05 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909084899


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -197,34 +198,46 @@ class UnivocityParser(
 Decimal(decimalParser(datum), dt.precision, dt.scale)
   }
 
-case _: TimestampType => (d: String) =>
+case _: DateType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  timestampFormatter.parse(datum)
+  dateFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToTimestamp(str, 
options.zoneId).getOrElse(throw e)
+DateTimeUtils.stringToDate(str).getOrElse(throw e)
 }
   }
 
-case _: TimestampNTZType => (d: String) =>
-  nullSafeDatum(d, name, nullable, options) { datum =>
-timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-  }
-
-case _: DateType => (d: String) =>
+case _: TimestampType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  dateFormatter.parse(datum)
+  timestampFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToDate(str).getOrElse(throw e)
+DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+  // There may be date type entries in timestamp column due to 
schema inference
+  if (options.inferDate) {
+daysToMicros(dateFormatter.parse(datum), options.zoneId)
+  } else {
+throw(e)
+  }
+}
+}
+  }
+
+case _: TimestampNTZType => (d: String) =>
+  nullSafeDatum(d, name, nullable, options) { datum =>
+try {
+  timestampNTZFormatter.parseWithoutTimeZone(datum, false)
+} catch {
+  case NonFatal(e) if (options.inferDate) =>
+daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   Wow great catch! I hadn't fully considered the effect of user timeZone on 
TimestampNTZ parsing. 
   
   I've fixed the error and I've modified this test to have a PST user and 
check that the parsed date is converted to a timestamp in UTC. 
   
   
https://github.com/Jonathancui123/spark/blob/8df7ebbd6c1c0d5bb875ce554026f9b9aeae148e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala#L368-L374
 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-05 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909084899


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -197,34 +198,46 @@ class UnivocityParser(
 Decimal(decimalParser(datum), dt.precision, dt.scale)
   }
 
-case _: TimestampType => (d: String) =>
+case _: DateType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  timestampFormatter.parse(datum)
+  dateFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToTimestamp(str, 
options.zoneId).getOrElse(throw e)
+DateTimeUtils.stringToDate(str).getOrElse(throw e)
 }
   }
 
-case _: TimestampNTZType => (d: String) =>
-  nullSafeDatum(d, name, nullable, options) { datum =>
-timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-  }
-
-case _: DateType => (d: String) =>
+case _: TimestampType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  dateFormatter.parse(datum)
+  timestampFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToDate(str).getOrElse(throw e)
+DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+  // There may be date type entries in timestamp column due to 
schema inference
+  if (options.inferDate) {
+daysToMicros(dateFormatter.parse(datum), options.zoneId)
+  } else {
+throw(e)
+  }
+}
+}
+  }
+
+case _: TimestampNTZType => (d: String) =>
+  nullSafeDatum(d, name, nullable, options) { datum =>
+try {
+  timestampNTZFormatter.parseWithoutTimeZone(datum, false)
+} catch {
+  case NonFatal(e) if (options.inferDate) =>
+daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   Wow great catch! I hadn't fully considered the effect of user timeZone on 
TimestampNTZ parsing. 
   
   I've fixed the error and I've modified this test to have a PST user and 
check that the parsed date is converted to a timestamp in UTC. 
   
   
https://github.com/Jonathancui123/spark/blob/8df7ebbd6c1c0d5bb875ce554026f9b9aeae148e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala#L368-L374



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-28 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909085510


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##
@@ -358,4 +359,26 @@ class UnivocityParserSuite extends SparkFunSuite with 
SQLHelper {
   Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, 
"UTC")
 check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("SPARK-39469: dates should be parsed correctly in a timestamp column 
when inferDate=true") {
+def checkDate(dataType: DataType): Unit = {
+  val timestampsOptions =
+new CSVOptions(Map("inferDate" -> "true", "timestampFormat" -> 
"dd/MM/ HH:mm",
+  "timestampNTZFormat" -> "dd-MM- HH:mm", "dateFormat" -> 
"dd_MM_"),
+  false, DateTimeUtils.getZoneId("-08:00").toString)
+  // Use CSVOption ZoneId="-08:00" (PST) to test that Dates in 
TimestampNTZ column are always
+  // converted to their equivalent UTC timestamp
+  val dateString = "08_09_2001"
+  val expected = dataType match {
+case TimestampType => date(2001, 9, 8, 0, 0, 0, 0, 
ZoneOffset.of("-08:00"))
+case TimestampNTZType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.UTC)

Review Comment:
   > I think zoneId should probably be UTC for timestamp_ntz. Otherwise, you 
end up with oddities like this...
   
   @bersprockets I've modified this test to have the user in PST and check that 
the parsed date is converted to a timestamp in UTC. This checks for the error 
you caught in your [previous 
comment](https://github.com/apache/spark/pull/36871#discussion_r906513077). 
Thanks!
   
   I think the PR should be ready to merge. Please let me know if there's 
anything else we need to fix. I'll keep an eye on the Github Actions results.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-28 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909085510


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##
@@ -358,4 +359,26 @@ class UnivocityParserSuite extends SparkFunSuite with 
SQLHelper {
   Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, 
"UTC")
 check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("SPARK-39469: dates should be parsed correctly in a timestamp column 
when inferDate=true") {
+def checkDate(dataType: DataType): Unit = {
+  val timestampsOptions =
+new CSVOptions(Map("inferDate" -> "true", "timestampFormat" -> 
"dd/MM/ HH:mm",
+  "timestampNTZFormat" -> "dd-MM- HH:mm", "dateFormat" -> 
"dd_MM_"),
+  false, DateTimeUtils.getZoneId("-08:00").toString)
+  // Use CSVOption ZoneId="-08:00" (PST) to test that Dates in 
TimestampNTZ column are always
+  // converted to their equivalent UTC timestamp
+  val dateString = "08_09_2001"
+  val expected = dataType match {
+case TimestampType => date(2001, 9, 8, 0, 0, 0, 0, 
ZoneOffset.of("-08:00"))
+case TimestampNTZType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.UTC)

Review Comment:
   > I think zoneId should probably be UTC for timestamp_ntz. Otherwise, you 
end up with oddities like this...
   
   @bersprockets I've modified this test to have the user in PST and check that 
the parsed date is converted to a timestamp in UTC. This checks for the error 
you caught in your [previous 
comment](https://github.com/apache/spark/pull/36871#discussion_r906513077). 
Thanks!
   
   I think the PR should be ready to merge. Please let me know if there's 
anything else we need to fix



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-28 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909085510


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##
@@ -358,4 +359,26 @@ class UnivocityParserSuite extends SparkFunSuite with 
SQLHelper {
   Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, 
"UTC")
 check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("SPARK-39469: dates should be parsed correctly in a timestamp column 
when inferDate=true") {
+def checkDate(dataType: DataType): Unit = {
+  val timestampsOptions =
+new CSVOptions(Map("inferDate" -> "true", "timestampFormat" -> 
"dd/MM/ HH:mm",
+  "timestampNTZFormat" -> "dd-MM- HH:mm", "dateFormat" -> 
"dd_MM_"),
+  false, DateTimeUtils.getZoneId("-08:00").toString)
+  // Use CSVOption ZoneId="-08:00" (PST) to test that Dates in 
TimestampNTZ column are always
+  // converted to their equivalent UTC timestamp
+  val dateString = "08_09_2001"
+  val expected = dataType match {
+case TimestampType => date(2001, 9, 8, 0, 0, 0, 0, 
ZoneOffset.of("-08:00"))
+case TimestampNTZType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.UTC)

Review Comment:
   > I think zoneId should probably be UTC for timestamp_ntz. Otherwise, you 
end up with oddities like this...
   
   @bersprockets I've modified this test to have the user in PST and check that 
the parsed date is converted to a timestamp in UTC. This checks for the error 
you caught in your [previous 
comment](https://github.com/apache/spark/pull/36871#discussion_r906513077). 
Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-28 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909084899


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -197,34 +198,46 @@ class UnivocityParser(
 Decimal(decimalParser(datum), dt.precision, dt.scale)
   }
 
-case _: TimestampType => (d: String) =>
+case _: DateType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  timestampFormatter.parse(datum)
+  dateFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToTimestamp(str, 
options.zoneId).getOrElse(throw e)
+DateTimeUtils.stringToDate(str).getOrElse(throw e)
 }
   }
 
-case _: TimestampNTZType => (d: String) =>
-  nullSafeDatum(d, name, nullable, options) { datum =>
-timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-  }
-
-case _: DateType => (d: String) =>
+case _: TimestampType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  dateFormatter.parse(datum)
+  timestampFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToDate(str).getOrElse(throw e)
+DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+  // There may be date type entries in timestamp column due to 
schema inference
+  if (options.inferDate) {
+daysToMicros(dateFormatter.parse(datum), options.zoneId)
+  } else {
+throw(e)
+  }
+}
+}
+  }
+
+case _: TimestampNTZType => (d: String) =>
+  nullSafeDatum(d, name, nullable, options) { datum =>
+try {
+  timestampNTZFormatter.parseWithoutTimeZone(datum, false)
+} catch {
+  case NonFatal(e) if (options.inferDate) =>
+daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   Wow great catch! I hadn't fully considered the effect of user timeZone on 
TimestampNTZ parsing. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-28 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909035305


##
docs/sql-data-sources-csv.md:
##
@@ -108,6 +108,12 @@ Data source options of CSV can be set via:
 Infers the input schema automatically from data. It requires one extra 
pass over the data. CSV built-in functions ignore this option.
 read
   
+  
+inferDate 
+false
+Whether or not to infer columns that satisfy the 
dateFormat option as Date. Requires 
inferSchema to be true. When false, 
columns with dates will be inferred as String (or as 
Timestamp if it fits the timestampFormat) Legacy date 
formats in Timestamp columns cannot be parsed with this option.

Review Comment:
   This was in reference to the fact that only the non-legacy datePaser is used 
for parsing in timestamp columns - as explained in [the first comment of this 
thread](https://github.com/apache/spark/pull/36871/files#r903187641).
   
   I'll remove it. This sentence no longer makes sense since we don't support 
date inference for legacy date formats. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-28 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r909035305


##
docs/sql-data-sources-csv.md:
##
@@ -108,6 +108,12 @@ Data source options of CSV can be set via:
 Infers the input schema automatically from data. It requires one extra 
pass over the data. CSV built-in functions ignore this option.
 read
   
+  
+inferDate 
+false
+Whether or not to infer columns that satisfy the 
dateFormat option as Date. Requires 
inferSchema to be true. When false, 
columns with dates will be inferred as String (or as 
Timestamp if it fits the timestampFormat) Legacy date 
formats in Timestamp columns cannot be parsed with this option.

Review Comment:
   This was in reference to the fact that only the non-legacy datePaser is used 
for parsing in timestamp columns - as explained in [the first comment of this 
thread](https://github.com/apache/spark/pull/36871/files#r903187641).
   
   That sentence no longer makes sense since we don't support date inference 
for legacy date formats. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-28 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r908942583


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -197,34 +199,50 @@ class UnivocityParser(
 Decimal(decimalParser(datum), dt.precision, dt.scale)
   }
 
-case _: TimestampType => (d: String) =>
+case _: DateType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  timestampFormatter.parse(datum)
+  dateFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToTimestamp(str, 
options.zoneId).getOrElse(throw e)
+DateTimeUtils.stringToDate(str).getOrElse(throw e)
 }
   }
 
-case _: TimestampNTZType => (d: String) =>
-  nullSafeDatum(d, name, nullable, options) { datum =>
-timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-  }
-
-case _: DateType => (d: String) =>
+case _: TimestampType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  dateFormatter.parse(datum)
+  timestampFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToDate(str).getOrElse(throw e)
+DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+  // There may be date type entries in timestamp column due to 
schema inference
+  if (options.inferDate) {
+daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   Thanks Bruce! This is great context! This will definitely be necessary if we 
want to support inference along with legacy date formats. Users on legacy dates 
will be unaffected by this change - how about we can open another ticket for 
date inference with legacy formats if the demand exists (and merge this PR 
without legacy date inference support)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-24 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r906439733


##
core/src/main/resources/error/error-classes.json:
##
@@ -23,6 +23,12 @@
 ],
 "sqlState" : "22005"
   },
+  "CANNOT_INFER_DATE" : {
+"message" : [
+  "Cannot infer date in schema inference when LegacyTimeParserPolicy is 
\"LEGACY\". Legacy Date formatter does not support strict date format matching 
which is required to avoid inferring timestamps and other non-date entries to 
date."
+],
+"sqlState" : "22005"

Review Comment:
   Changed to 22007 for date format error



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-24 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r906233209


##
core/src/main/resources/error/error-classes.json:
##
@@ -23,6 +23,12 @@
 ],
 "sqlState" : "22005"
   },
+  "CANNOT_INFER_DATE" : {
+"message" : [
+  "Cannot infer date in schema inference when LegacyTimeParserPolicy is 
\"LEGACY\". Legacy Date formatter does not support strict date format matching 
which is required to avoid inferring timestamps and other non-date entries to 
date."
+],
+"sqlState" : "22005"

Review Comment:
   Is this the right sqlState code? Tests would not pass until I added a 
sqlState code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-24 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r906232351


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -197,34 +199,50 @@ class UnivocityParser(
 Decimal(decimalParser(datum), dt.precision, dt.scale)
   }
 
-case _: TimestampType => (d: String) =>
+case _: DateType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  timestampFormatter.parse(datum)
+  dateFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToTimestamp(str, 
options.zoneId).getOrElse(throw e)
+DateTimeUtils.stringToDate(str).getOrElse(throw e)
 }
   }
 
-case _: TimestampNTZType => (d: String) =>
-  nullSafeDatum(d, name, nullable, options) { datum =>
-timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-  }
-
-case _: DateType => (d: String) =>
+case _: TimestampType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  dateFormatter.parse(datum)
+  timestampFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToDate(str).getOrElse(throw e)
+DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+  // There may be date type entries in timestamp column due to 
schema inference
+  if (options.inferDate) {
+daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   @bersprockets Thanks for the suggestion! Do you know what is the advantage 
of allowing Legacy Formatter? i.e. what is a date format that the legacy 
formatter can handle but the current formatter cannot?
   
   I'm wondering if there will be a sufficient population of users who want to 
infer date in schema and also use legacy date formats
   
   cc: @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-24 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r905350901


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 case LongType => tryParseLong(field)
 case _: DecimalType => tryParseDecimal(field)
 case DoubleType => tryParseDouble(field)
-case TimestampNTZType => tryParseTimestampNTZ(field)
-case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDateTime(field)

Review Comment:
   I think this change is necessary:
   Consider the column of a TimestampType followed by a DateType entry. We 
would expect this column to be inferred as a TimestampType column.
   `typeSoFar` will be `Timestamp` when `inferField` is called on the second 
entry which is `DateType`. We need logic in `inferField` to try and parse 
`DateType` even when `typeSoFar` is `Timestamp`
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-23 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r905350901


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 case LongType => tryParseLong(field)
 case _: DecimalType => tryParseDecimal(field)
 case DoubleType => tryParseDouble(field)
-case TimestampNTZType => tryParseTimestampNTZ(field)
-case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDateTime(field)

Review Comment:
   I this change is necessary:
   Consider the column of a TimestampType followed by a DateType entry. We 
would expect this column to be inferred as a TimestampType column.
   `typeSoFar` will be `Timestamp` when `inferField` is called on the second 
entry which is `DateType`. We need logic in `inferField` to try and parse 
`DateType` even when `typeSoFar` is `Timestamp`
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-23 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r905350901


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 case LongType => tryParseLong(field)
 case _: DecimalType => tryParseDecimal(field)
 case DoubleType => tryParseDouble(field)
-case TimestampNTZType => tryParseTimestampNTZ(field)
-case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDateTime(field)

Review Comment:
   I this change is necessary:
   Consider the column of a TimestampType followed by a DateType entry. We 
would expect this column to be inferred as a TimestampType column.
   `typeSoFar` will be `Timestamp` when `inferField` is called on the second 
entry which is DateType. We need logic in `inferField` to try and parse Date 
type even when `typeSoFar` is `Timestamp`
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-23 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r905350901


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 case LongType => tryParseLong(field)
 case _: DecimalType => tryParseDecimal(field)
 case DoubleType => tryParseDouble(field)
-case TimestampNTZType => tryParseTimestampNTZ(field)
-case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDateTime(field)

Review Comment:
   Consider the column of a TimestampType followed by a DateType entry. We 
would expect this column to be inferred as a TimestampType column.
   `typeSoFar` will be `Timestamp` when `inferField` is called on the second 
entry which is DateType. We need logic in `inferField` to try and parse Date 
type even when `typeSoFar` is `Timestamp`
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-22 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r904402104


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -110,15 +116,43 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
   def inferField(typeSoFar: DataType, field: String): DataType = {
 if (field == null || field.isEmpty || field == options.nullValue) {
   typeSoFar
+} else
+if (options.inferDate) {
+  val typeElemInfer = typeSoFar match {
+case NullType => tryParseInteger(field)
+case IntegerType => tryParseInteger(field)
+case LongType => tryParseLong(field)
+case _: DecimalType => tryParseDecimal(field)
+case DoubleType => tryParseDouble(field)
+case DateType => tryParseDateTime(field)
+case TimestampNTZType => tryParseDateTime(field)
+case TimestampType => tryParseDateTime(field)
+case BooleanType => tryParseBoolean(field)
+case StringType => StringType
+case other: DataType =>
+  throw QueryExecutionErrors.dataTypeUnexpectedError(other)
+  }
+  compatibleType(typeSoFar, typeElemInfer).getOrElse(StringType)
 } else {
   val typeElemInfer = typeSoFar match {
 case NullType => tryParseInteger(field)
 case IntegerType => tryParseInteger(field)
 case LongType => tryParseLong(field)
 case _: DecimalType => tryParseDecimal(field)
 case DoubleType => tryParseDouble(field)
-case TimestampNTZType => tryParseTimestampNTZ(field)
-case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDateTime(field)
+case TimestampNTZType =>
+  if (options.inferDate) {

Review Comment:
   My bad, I meant to remove the first match expression but missed it during 
code cleanup



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-21 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903186613


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##
@@ -2753,6 +2754,35 @@ abstract class CSVSuite
   }
 }
   }
+
+  test("SPARK-39469: Infer schema for date type") {
+val options = Map(
+  "header" -> "true",
+  "inferSchema" -> "true",
+  "timestampFormat" -> "-MM-dd'T'HH:mm",
+  "dateFormat" -> "-MM-dd",
+  "inferDate" -> "true")
+
+val results = spark.read
+  .format("csv")
+  .options(options)
+  .load(testFile(dateInferSchemaFile))
+
+val expectedSchema = StructType(List(StructField("date", DateType),
+  StructField("timestamp-date", TimestampType), 
StructField("date-timestamp", TimestampType)))
+assert(results.schema == expectedSchema)
+
+val expected =
+  Seq(
+Seq(Date.valueOf("2001-9-8"), Timestamp.valueOf("2014-10-27 
18:30:0.0"),
+  Timestamp.valueOf("1765-03-28 00:00:0.0")),
+Seq(Date.valueOf("1941-1-2"), Timestamp.valueOf("2000-09-14 
01:01:0.0"),
+  Timestamp.valueOf("1423-11-12 23:41:0.0")),
+Seq(Date.valueOf("0293-11-7"), Timestamp.valueOf("1995-06-25 
00:00:00.0"),
+  Timestamp.valueOf("2016-01-28 20:00:00.0"))
+  )
+assert(results.collect().toSeq.map(_.toSeq) == expected)
+  }

Review Comment:
   > One test we might need would be timestampFormat" -> "dd/MM/ HH:mm and 
dateFormat -> dd/MM/ to make sure timestamps are not parsed as date types 
without conflicting.
   
   This test uses: 
   ```
 "timestampFormat" -> "-MM-dd'T'HH:mm",
 "dateFormat" -> "-MM-dd",
   ```
   
   This e2e test ensures that our DateFormatter is using strict parsing. We 
will not infer Timestamp columns as Date columns if the `DateFormat` is a 
prefix of the `TimestampFormat`. 
   
   Thank you for the review! @HyukjinKwon @bersprockets 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-21 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903189990


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##
@@ -358,4 +358,19 @@ class UnivocityParserSuite extends SparkFunSuite with 
SQLHelper {
   Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, 
"UTC")
 check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("dates should be parsed correctly in a timestamp column") {
+def checkDate(dataType: DataType): Unit = {
+  val timestampsOptions =
+new CSVOptions(Map("timestampFormat" -> "dd/MM/ HH:mm",
+  "timestampNTZFormat" -> "dd-MM- HH:mm", "dateFormat" -> 
"dd_MM_"),

Review Comment:
   I addressed inference mistakes in [the following code snippet and 
comment](https://github.com/apache/spark/pull/36871#discussion_r903186613)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-21 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903187641


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -197,34 +199,50 @@ class UnivocityParser(
 Decimal(decimalParser(datum), dt.precision, dt.scale)
   }
 
-case _: TimestampType => (d: String) =>
+case _: DateType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  timestampFormatter.parse(datum)
+  dateFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToTimestamp(str, 
options.zoneId).getOrElse(throw e)
+DateTimeUtils.stringToDate(str).getOrElse(throw e)
 }
   }
 
-case _: TimestampNTZType => (d: String) =>
-  nullSafeDatum(d, name, nullable, options) { datum =>
-timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-  }
-
-case _: DateType => (d: String) =>
+case _: TimestampType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  dateFormatter.parse(datum)
+  timestampFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToDate(str).getOrElse(throw e)
+DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+  // There may be date type entries in timestamp column due to 
schema inference
+  if (options.inferDate) {
+daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   We do not use the legacy DateFormatter here to avoid parsing timestamps with 
invalid suffixes. We want to throw an error when invalid timstamps are given. 
   
   e.g. The legacy DateFormatter will parse the following string without 
throwing an error:
   dateFormat: `-mm-dd`
   string: `2001-09-08-randomtext`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-21 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903187641


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -197,34 +199,50 @@ class UnivocityParser(
 Decimal(decimalParser(datum), dt.precision, dt.scale)
   }
 
-case _: TimestampType => (d: String) =>
+case _: DateType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  timestampFormatter.parse(datum)
+  dateFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToTimestamp(str, 
options.zoneId).getOrElse(throw e)
+DateTimeUtils.stringToDate(str).getOrElse(throw e)
 }
   }
 
-case _: TimestampNTZType => (d: String) =>
-  nullSafeDatum(d, name, nullable, options) { datum =>
-timestampNTZFormatter.parseWithoutTimeZone(datum, false)
-  }
-
-case _: DateType => (d: String) =>
+case _: TimestampType => (d: String) =>
   nullSafeDatum(d, name, nullable, options) { datum =>
 try {
-  dateFormatter.parse(datum)
+  timestampFormatter.parse(datum)
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToDate(str).getOrElse(throw e)
+DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+  // There may be date type entries in timestamp column due to 
schema inference
+  if (options.inferDate) {
+daysToMicros(dateFormatter.parse(datum), options.zoneId)

Review Comment:
   We do not use the legacy DateFormatter here to avoid parsing timestamps with 
invalid suffixes. We want to throw an error. 
   
   e.g. The legacy DateFormatter will parse the following string without 
throwing an error:
   dateFormat: `-mm-dd`
   string: `2001-09-08-randomtext`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-21 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903186613


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##
@@ -2753,6 +2754,35 @@ abstract class CSVSuite
   }
 }
   }
+
+  test("SPARK-39469: Infer schema for date type") {
+val options = Map(
+  "header" -> "true",
+  "inferSchema" -> "true",
+  "timestampFormat" -> "-MM-dd'T'HH:mm",
+  "dateFormat" -> "-MM-dd",
+  "inferDate" -> "true")
+
+val results = spark.read
+  .format("csv")
+  .options(options)
+  .load(testFile(dateInferSchemaFile))
+
+val expectedSchema = StructType(List(StructField("date", DateType),
+  StructField("timestamp-date", TimestampType), 
StructField("date-timestamp", TimestampType)))
+assert(results.schema == expectedSchema)
+
+val expected =
+  Seq(
+Seq(Date.valueOf("2001-9-8"), Timestamp.valueOf("2014-10-27 
18:30:0.0"),
+  Timestamp.valueOf("1765-03-28 00:00:0.0")),
+Seq(Date.valueOf("1941-1-2"), Timestamp.valueOf("2000-09-14 
01:01:0.0"),
+  Timestamp.valueOf("1423-11-12 23:41:0.0")),
+Seq(Date.valueOf("0293-11-7"), Timestamp.valueOf("1995-06-25 
00:00:00.0"),
+  Timestamp.valueOf("2016-01-28 20:00:00.0"))
+  )
+assert(results.collect().toSeq.map(_.toSeq) == expected)
+  }

Review Comment:
   > One test we might need would be timestampFormat" -> "dd/MM/ HH:mm and 
dateFormat -> dd/MM/ to make sure timestamps are not parsed as date types 
without conflicting.
   
   This test uses: 
   ```
 "timestampFormat" -> "-MM-dd'T'HH:mm",
 "dateFormat" -> "-MM-dd",
   ```
   
   This e2e test ensures that our DateFormatter is using strict parsing. We 
will not infer Timestamp columns as Date columns if the `DateFormat` is a 
prefix of the `TimestampFormat`. 
   @HyukjinKwon @bersprockets 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-21 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r903184632


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -206,28 +218,27 @@ class UnivocityParser(
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility.
 val str = 
DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(datum))
-DateTimeUtils.stringToTimestamp(str, 
options.zoneId).getOrElse(throw e)
+DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse {
+  // There may be date type entries in timestamp column due to 
schema inference
+  convertToDate(datum)

Review Comment:
   Consider the column of a `DateType` followed by a `TimestampType`. We would 
expect this column to be inferred as a `TimestampType` column. 
   
   Thus, when parsing the column, the timestamp converter will fail on the Date 
entry so we will need to try and convert it with the Date converter. If both 
converters fail, then we will throw an error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-06-15 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r898694871


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##
@@ -358,4 +358,19 @@ class UnivocityParserSuite extends SparkFunSuite with 
SQLHelper {
   Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, 
"UTC")
 check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("dates should be parsed correctly in a timestamp column") {

Review Comment:
   Got it! I was under the impression that was only for bug fixes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org