[ https://issues.apache.org/jira/browse/ARROW-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560679#comment-17560679 ]
Jonathan Keane edited comment on ARROW-15805 at 6/29/22 11:31 PM: ------------------------------------------------------------------ This is alluded to in the PR comments, but taking a step back and thinking about the behavior: {code} dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-01-01" NA NA NA "2022-01-01" #> [6] "2022-01-01" as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-02-02" NA "2022-02-02" "2022-02-02" NA #> [6] NA {code} Which format is chosen and used is dependent on the underlying data, and critically the order that data is in. Given that we can't always guaranty the order of the data we are processing[1] we should not attempt to implement this behavior right now. Instead, we should have an error message if someone tries to specify {{tryFormats}} suggesting that they might use {{lubridate::as_date()}} if they want to specify multiple formats (and can accept that you don't get NAs for all formats other than the first that matches), or they should pick which format they want to use and use that. [1] and even if we could, it would take some tricky expression writing to pick the right format was (Author: jonkeane): This is alluded to in the PR comments, but taking a step back and thinking about the behavior: {code} dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-01-01" NA NA NA "2022-01-01" #> [6] "2022-01-01" as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-02-02" NA "2022-02-02" "2022-02-02" NA #> [6] NA {code} Which format is chosen and used is dependent on the underlying data, and critically the order that data is in. Given that we can't always guaranty the order of the data we are processing[1] we should not attempt to implement this behavior right now. Instead, we should have an error message if someone tries to specify {{tryFormats}} suggesting that they might use {{lubridate:: as_date()}} if they want to specify multiple formats (and can accept that you don't get NAs for all formats other than the first that matches), or they should pick which format they want to use and use that. [1] and even if we could, it would take some tricky expression writing to pick the right format > [R] Update the as.Date() binding > -------------------------------- > > Key: ARROW-15805 > URL: https://issues.apache.org/jira/browse/ARROW-15805 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Reporter: Dragoș Moldovan-Grünfeld > Priority: Major > Fix For: 9.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)