cloud-fan commented on PR #37933: URL: https://github.com/apache/spark/pull/37933#issuecomment-1253301260
There are many cases to consider here: 1) the CSV data is pure date, pure timestamp, or a mixture. 2) the user specifies datetime pattern or not. 1. pure date + no datetime pattern: infer as date type 2. pure timestamp + no datetime pattern: infer as timestamp type 3. mixture + no datetime pattern: infer as timestamp type 4. pure date + datetime pattern: if pattern matches, infer as date type, otherwise string type 5. pure timestamp + datetime pattern: if pattern matches, infer as timestamp type, otherwise string type 6. mixture + datetime pattern: I think this is where the problem occurs. We will first parse the data as date, if can't, try parse as timestamp. This is very slow as we invoke the formatter twice. I think we shouldn't support mixture of date and timestamp in this case. If `prefersDate` is true, only try to infer as date type, otherwise only try to infer as timestamp. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org