[jira] [Comment Edited] (ARROW-14471) [R] Implement lubridate's date/time parsing functions
[ https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488164#comment-17488164 ] Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/8/22, 3:54 PM: --- {{lubridate}} uses {{guess_formats()}} to identify the likely candidates. We could try something similar, where we have a list of supported formats (something similar to [this|https://github.com/dragosmg/arrow/blob/cfba9e1dfbedd5dfdf652c805e93692808dd092e/r/R/dplyr-funcs-datetime.R#L152-L196]), which we then narrow down to the most likely ones. Only then use something like {{{}coalesce(){}}}. was (Author: dragosmg): {{lubridate}} has {{guess_formats()}} to identify the likely candidates. We could try something similar, where we have a list of supported formats (something similar to [this|https://github.com/dragosmg/arrow/blob/cfba9e1dfbedd5dfdf652c805e93692808dd092e/r/R/dplyr-funcs-datetime.R#L152-L196]), which we then narrow down to the most likely ones. Only then use something like {{{}coalesce(){}}}. > [R] Implement lubridate's date/time parsing functions > - > > Key: ARROW-14471 > URL: https://issues.apache.org/jira/browse/ARROW-14471 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Parse dates with year, month, and day components: > ymd() ydm() mdy() myd() dmy() dym() yq() ym() my() > > Parse date-times with year, month, and day, hour, minute, and second > components: > ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() > mdy_h() ydm_hms() ydm_hm() ydm_h() > Parse periods with hour, minute, and second components: > ms() hm() hms() > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-14471) [R] Implement lubridate's date/time parsing functions
[ https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488059#comment-17488059 ] Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/8/22, 3:54 PM: --- Another alternative would be for {{strptime}} to error when the selected format does not match the data (for example, attempting to parse {{"09-12-31"}} with {{{}"%Y-%m-%d"{}}}should error due to a mismatch in the length of the year). Then we could rely on this behaviour with {{{}coalesce{}}}. was (Author: dragosmg): Another alternative would be for {{strptime}} to error when the selected format does not match the data (for example, attempting to parse {{"09-12-31"}} with {{{}"%Y-%m-%d"{}}}should error due to a mismatch in the length of the year). Then we could rely on this behaviour with {{{}coalesce{}}}. Should I create a ticket for this? > [R] Implement lubridate's date/time parsing functions > - > > Key: ARROW-14471 > URL: https://issues.apache.org/jira/browse/ARROW-14471 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Parse dates with year, month, and day components: > ymd() ydm() mdy() myd() dmy() dym() yq() ym() my() > > Parse date-times with year, month, and day, hour, minute, and second > components: > ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() > mdy_h() ydm_hms() ydm_hm() ydm_h() > Parse periods with hour, minute, and second components: > ms() hm() hms() > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-14471) [R] Implement lubridate's date/time parsing functions
[ https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488048#comment-17488048 ] Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/7/22, 3:08 PM: --- [~paleolimbot] I don't think we can rely on {{coalesce()}} to iterate through the various formats supported for {{{}ymd(){}}}. It would need to rely on the assumption that the passed {{format}} matches the data or otherwise fail. Sadly, arrow works with a wrong format resulting in weird timestamps: {code:r} suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(arrow)) suppressPackageStartupMessages(library(lubridate)) df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03")) df #> # A tibble: 3 × 1 #> x #> #> 1 09-01-01 #> 2 09-01-02 #> 3 09-01-03 # lubridate::ymd() df %>% mutate(y = ymd(x)) #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 2009-01-01 #> 2 09-01-02 2009-01-02 #> 3 09-01-03 2009-01-03 # y = short year correct df %>% record_batch() %>% mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% collect() #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 2009-01-01 00:00:00 #> 2 09-01-02 2009-01-02 00:00:00 #> 3 09-01-03 2009-01-03 00:00:00 # Y = long year this should fail in order for us to rely on coalesce df %>% record_batch() %>% mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% collect() #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 0008-12-31 23:58:45 #> 2 09-01-02 0009-01-01 23:58:45 #> 3 09-01-03 0009-01-02 23:58:45 {code} Therefore, my early (and somewhat naive) conclusion would be that we cannot implement {{arrow::ymd()}} binding as {{{}coalesce(strptime(x, format1), strptime(x, format2), ...){}}}. What do you think? was (Author: dragosmg): [~paleolimbot] I don't think we can rely on {{coalesce()}} to iterate through the various formats supported for {{ymd()}}. It would need to rely on the assumption that the passed {{format}} matches the data or otherwise fail. Sadly, arrow works with a wrong format resulting in weird timestamps: {code:r} suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(arrow)) suppressPackageStartupMessages(library(lubridate)) df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03")) df #> # A tibble: 3 × 1 #> x #> #> 1 09-01-01 #> 2 09-01-02 #> 3 09-01-03 # lubridate::ymd() df %>% mutate(y = ymd(x)) #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 2009-01-01 #> 2 09-01-02 2009-01-02 #> 3 09-01-03 2009-01-03 # y = short year correct df %>% record_batch() %>% mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% collect() #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 2009-01-01 00:00:00 #> 2 09-01-02 2009-01-02 00:00:00 #> 3 09-01-03 2009-01-03 00:00:00 # Y = long year this should fail in order for us to rely on coalesce df %>% record_batch() %>% mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% collect() #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 0008-12-31 23:58:45 #> 2 09-01-02 0009-01-01 23:58:45 #> 3 09-01-03 0009-01-02 23:58:45 {code} Therefore, my conclusion would be that we cannot implement {{arrow::ymd()}} binding as {{coalesce(strptime(x, format1), strptime(x, format2), ...)}}. What do you think? > [R] Implement lubridate's date/time parsing functions > - > > Key: ARROW-14471 > URL: https://issues.apache.org/jira/browse/ARROW-14471 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Parse dates with year, month, and day components: > ymd() ydm() mdy() myd() dmy() dym() yq() ym() my() > > Parse date-times with year, month, and day, hour, minute, and second > components: > ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() > mdy_h() ydm_hms() ydm_hm() ydm_h() > Parse periods with hour, minute, and second components: > ms() hm() hms() > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-14471) [R] Implement lubridate's date/time parsing functions
[ https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488050#comment-17488050 ] Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/7/22, 11:41 AM: We could to some processing to figure out how many characters we have (in the string to be parsed) in between the separators (or how many characters we have in total, in the cases where we have no separator) and only try with the suitable formats. i.e. in the example above not try to parse with {{{}%Y{}}}, only {{{}%y{}}}. was (Author: dragosmg): We could to some processing to figure out how many characters we have in between the separators (or how many characters we have in total, in the cases where we have no separator) and only try with the suitable formats. i.e. in the example above not try to parse with {{%Y}}, only {{%y}}. > [R] Implement lubridate's date/time parsing functions > - > > Key: ARROW-14471 > URL: https://issues.apache.org/jira/browse/ARROW-14471 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 8.0.0 > > > Parse dates with year, month, and day components: > ymd() ydm() mdy() myd() dmy() dym() yq() ym() my() > > Parse date-times with year, month, and day, hour, minute, and second > components: > ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() > mdy_h() ydm_hms() ydm_hm() ydm_h() > Parse periods with hour, minute, and second components: > ms() hm() hms() > -- This message was sent by Atlassian Jira (v8.20.1#820001)