[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

GitBox Wed, 21 Sep 2022 17:22:52 -0700


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r977079883



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##########
@@ -233,7 +238,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
    * is compatible with both input data types.
    */
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-    TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+    (t1, t2) match {
+      case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+           (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+        // For a column containing a mixture of dates and timestamps
+        // infer it as timestamp type if its dates can be inferred as 
timestamp type
+        // otherwise infer it as StringType

Review Comment:
   We want to have consistent behavior when timestamp format is not specified.
   
   When `prefersDate=false`, a column with mixed date and timestamp could be 
inferred as timestamp if possible.
   Thus, we added the additional handling here for a similar behavior as above 
when `prefersDate=true`.
   
   Does this make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Reply via email to