[jira] [Comment Edited] (ARROW-14471) [R] Implement lubridate's date/time parsing functions

2022-02-08 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488164#comment-17488164
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/8/22, 3:54 PM:
---

{{lubridate}} uses {{guess_formats()}} to identify the likely candidates. We 
could try something similar, where we have a list of supported formats 
(something similar to 
[this|https://github.com/dragosmg/arrow/blob/cfba9e1dfbedd5dfdf652c805e93692808dd092e/r/R/dplyr-funcs-datetime.R#L152-L196]),
 which we then narrow down to the most likely ones. Only then use something 
like {{{}coalesce(){}}}.


was (Author: dragosmg):
{{lubridate}} has {{guess_formats()}} to identify the likely candidates. We 
could try something similar, where we have a list of supported formats 
(something similar to 
[this|https://github.com/dragosmg/arrow/blob/cfba9e1dfbedd5dfdf652c805e93692808dd092e/r/R/dplyr-funcs-datetime.R#L152-L196]),
 which we then narrow down to the most likely ones. Only then use something 
like {{{}coalesce(){}}}.

> [R] Implement lubridate's date/time parsing functions
> -
>
> Key: ARROW-14471
> URL: https://issues.apache.org/jira/browse/ARROW-14471
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
>   
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-14471) [R] Implement lubridate's date/time parsing functions

2022-02-08 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488059#comment-17488059
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/8/22, 3:54 PM:
---

Another alternative would be for {{strptime}} to error when the selected format 
does not match the data (for example, attempting to parse {{"09-12-31"}} with 
{{{}"%Y-%m-%d"{}}}should error due to a mismatch in the length of the year). 
Then we could rely on this behaviour with {{{}coalesce{}}}. 


was (Author: dragosmg):
Another alternative would be for {{strptime}} to error when the selected format 
does not match the data (for example, attempting to parse {{"09-12-31"}} with 
{{{}"%Y-%m-%d"{}}}should error due to a mismatch in the length of the year). 
Then we could rely on this behaviour with {{{}coalesce{}}}. Should I create a 
ticket for this?

> [R] Implement lubridate's date/time parsing functions
> -
>
> Key: ARROW-14471
> URL: https://issues.apache.org/jira/browse/ARROW-14471
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
>   
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-14471) [R] Implement lubridate's date/time parsing functions

2022-02-07 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488048#comment-17488048
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/7/22, 3:08 PM:
---

[~paleolimbot] I don't think we can rely on {{coalesce()}} to iterate through 
the various formats supported for {{{}ymd(){}}}. It would need to rely on the 
assumption that the passed {{format}} matches the data or otherwise fail. 
Sadly, arrow works with a wrong format resulting in weird timestamps:
{code:r}
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(lubridate))

df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03"))
df
#> # A tibble: 3 × 1
#>   x   
#>  
#> 1 09-01-01
#> 2 09-01-02
#> 3 09-01-03

# lubridate::ymd()
df %>% 
  mutate(y = ymd(x))
#> # A tibble: 3 × 2
#>   xy 
#>   
#> 1 09-01-01 2009-01-01
#> 2 09-01-02 2009-01-02
#> 3 09-01-03 2009-01-03

# y = short year correct
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   xy  
#>
#> 1 09-01-01 2009-01-01 00:00:00
#> 2 09-01-02 2009-01-02 00:00:00
#> 3 09-01-03 2009-01-03 00:00:00

# Y = long year this should fail in order for us to rely on coalesce
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   xy  
#>
#> 1 09-01-01 0008-12-31 23:58:45
#> 2 09-01-02 0009-01-01 23:58:45
#> 3 09-01-03 0009-01-02 23:58:45
{code}
Therefore, my early (and somewhat naive) conclusion would be that we cannot 
implement {{arrow::ymd()}} binding as {{{}coalesce(strptime(x, format1), 
strptime(x, format2), ...){}}}. What do you think?


was (Author: dragosmg):
[~paleolimbot] I don't think we can rely on {{coalesce()}} to iterate through 
the various formats supported for {{ymd()}}. It would need to rely on the 
assumption that the passed {{format}} matches the data or otherwise fail. 
Sadly, arrow works with a wrong format resulting in weird timestamps:

{code:r}
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(lubridate))

df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03"))
df
#> # A tibble: 3 × 1
#>   x   
#>  
#> 1 09-01-01
#> 2 09-01-02
#> 3 09-01-03

# lubridate::ymd()
df %>% 
  mutate(y = ymd(x))
#> # A tibble: 3 × 2
#>   xy 
#>   
#> 1 09-01-01 2009-01-01
#> 2 09-01-02 2009-01-02
#> 3 09-01-03 2009-01-03

# y = short year correct
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   xy  
#>
#> 1 09-01-01 2009-01-01 00:00:00
#> 2 09-01-02 2009-01-02 00:00:00
#> 3 09-01-03 2009-01-03 00:00:00

# Y = long year this should fail in order for us to rely on coalesce
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   xy  
#>
#> 1 09-01-01 0008-12-31 23:58:45
#> 2 09-01-02 0009-01-01 23:58:45
#> 3 09-01-03 0009-01-02 23:58:45
{code}

Therefore, my conclusion would be that we cannot implement {{arrow::ymd()}} 
binding as {{coalesce(strptime(x, format1), strptime(x, format2), ...)}}. What 
do you think?

> [R] Implement lubridate's date/time parsing functions
> -
>
> Key: ARROW-14471
> URL: https://issues.apache.org/jira/browse/ARROW-14471
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
>   
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-14471) [R] Implement lubridate's date/time parsing functions

2022-02-07 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488050#comment-17488050
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/7/22, 11:41 AM:


We could to some processing to figure out how many characters we have (in the 
string to be parsed) in between the separators (or how many characters we have 
in total, in the cases where we have no separator) and only try with the 
suitable formats. i.e. in the example above not try to parse with {{{}%Y{}}}, 
only {{{}%y{}}}.


was (Author: dragosmg):
We could to some processing to figure out how many characters we have in 
between the separators (or how many characters we have in total, in the cases 
where we have no separator) and only try with the suitable formats. i.e. in the 
example above not try to parse with {{%Y}}, only {{%y}}.

> [R] Implement lubridate's date/time parsing functions
> -
>
> Key: ARROW-14471
> URL: https://issues.apache.org/jira/browse/ARROW-14471
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 8.0.0
>
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
>   
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)