[GitHub] [spark] cloud-fan commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

GitBox Wed, 21 Sep 2022 00:15:44 -0700


cloud-fan commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1253301260


   There are many cases to consider here: 1) the CSV data is pure date, pure 
timestamp, or a mixture. 2) the user specifies datetime pattern or not.
   
   1. pure date + no datetime pattern: infer as date type
   2. pure timestamp + no datetime pattern: infer as timestamp type
   3. mixture + no datetime pattern: infer as timestamp type
   4. pure date + datetime pattern: if pattern matches, infer as date type, 
otherwise string type
   5. pure timestamp + datetime pattern: if pattern matches, infer as timestamp 
type, otherwise string type
   6. mixture + datetime pattern: I think this is where the problem occurs. We 
will first parse the data as date, if can't, try parse as timestamp. This is 
very slow as we invoke the formatter twice. I think we shouldn't support 
mixture of date and timestamp in this case. If `prefersDate` is true, only try 
to infer as date type, otherwise only try to infer as timestamp.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Reply via email to