xiaonanyang-db commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1253319740

   > There are many cases to consider here: 1) the CSV data is pure date, pure 
timestamp, or a mixture. 2) the user specifies datetime pattern or not.
   > 
   > 1. pure date + no datetime pattern: infer as date type
   > 2. pure timestamp + no datetime pattern: infer as timestamp type
   > 3. mixture + no datetime pattern: infer as timestamp type
   > 4. pure date + datetime pattern: if pattern matches, infer as date type, 
otherwise string type
   > 5. pure timestamp + datetime pattern: if pattern matches, infer as 
timestamp type, otherwise string type
   > 6. mixture + datetime pattern: I think this is where the problem occurs. 
We will first parse the data as date, if can't, try parse as timestamp. This is 
very slow as we invoke the formatter twice. I think we shouldn't support 
mixture of date and timestamp in this case. If `prefersDate` is true, only try 
to infer as date type, otherwise only try to infer as timestamp.
   
   Thanks @cloud-fan
   Case 1, 2, 4, 5 are already supported, case 3 is also already supported but 
this PR adjusts the implementation.
   For case 6, the behavior after this PR is that we will not always "first 
parse the data as date, if can't, try parse as timestamp". - When typeSoFar is 
`DateType` or some other tighter type, we will "first parse the data as date, 
if can't, try parse as timestamp"
   - However, when typeSoFar is already `TimestampType` and we encounter 
another date value, we will directly parse it as timestamp, which would fail 
and then finalize as `StringType`. 
   In reality, we would not invoke the formatter twice in most cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to