[ 
https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488048#comment-17488048
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/7/22, 3:08 PM:
---------------------------------------------------------------------------

[~paleolimbot] I don't think we can rely on {{coalesce()}} to iterate through 
the various formats supported for {{{}ymd(){}}}. It would need to rely on the 
assumption that the passed {{format}} matches the data or otherwise fail. 
Sadly, arrow works with a wrong format resulting in weird timestamps:
{code:r}
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(lubridate))

df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03"))
df
#> # A tibble: 3 × 1
#>   x       
#>   <chr>   
#> 1 09-01-01
#> 2 09-01-02
#> 3 09-01-03

# lubridate::ymd()
df %>% 
  mutate(y = ymd(x))
#> # A tibble: 3 × 2
#>   x        y         
#>   <chr>    <date>    
#> 1 09-01-01 2009-01-01
#> 2 09-01-02 2009-01-02
#> 3 09-01-03 2009-01-03

# y = short year correct
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   x        y                  
#>   <chr>    <dttm>             
#> 1 09-01-01 2009-01-01 00:00:00
#> 2 09-01-02 2009-01-02 00:00:00
#> 3 09-01-03 2009-01-03 00:00:00

# Y = long year this should fail in order for us to rely on coalesce
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   x        y                  
#>   <chr>    <dttm>             
#> 1 09-01-01 0008-12-31 23:58:45
#> 2 09-01-02 0009-01-01 23:58:45
#> 3 09-01-03 0009-01-02 23:58:45
{code}
Therefore, my early (and somewhat naive) conclusion would be that we cannot 
implement {{arrow::ymd()}} binding as {{{}coalesce(strptime(x, format1), 
strptime(x, format2), ...){}}}. What do you think?


was (Author: dragosmg):
[~paleolimbot] I don't think we can rely on {{coalesce()}} to iterate through 
the various formats supported for {{ymd()}}. It would need to rely on the 
assumption that the passed {{format}} matches the data or otherwise fail. 
Sadly, arrow works with a wrong format resulting in weird timestamps:

{code:r}
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(lubridate))

df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03"))
df
#> # A tibble: 3 × 1
#>   x       
#>   <chr>   
#> 1 09-01-01
#> 2 09-01-02
#> 3 09-01-03

# lubridate::ymd()
df %>% 
  mutate(y = ymd(x))
#> # A tibble: 3 × 2
#>   x        y         
#>   <chr>    <date>    
#> 1 09-01-01 2009-01-01
#> 2 09-01-02 2009-01-02
#> 3 09-01-03 2009-01-03

# y = short year correct
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   x        y                  
#>   <chr>    <dttm>             
#> 1 09-01-01 2009-01-01 00:00:00
#> 2 09-01-02 2009-01-02 00:00:00
#> 3 09-01-03 2009-01-03 00:00:00

# Y = long year this should fail in order for us to rely on coalesce
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   x        y                  
#>   <chr>    <dttm>             
#> 1 09-01-01 0008-12-31 23:58:45
#> 2 09-01-02 0009-01-01 23:58:45
#> 3 09-01-03 0009-01-02 23:58:45
{code}

Therefore, my conclusion would be that we cannot implement {{arrow::ymd()}} 
binding as {{coalesce(strptime(x, format1), strptime(x, format2), ...)}}. What 
do you think?

> [R] Implement lubridate's date/time parsing functions
> -----------------------------------------------------
>
>                 Key: ARROW-14471
>                 URL: https://issues.apache.org/jira/browse/ARROW-14471
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: R
>            Reporter: Nicola Crane
>            Assignee: Dragoș Moldovan-Grünfeld
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 8.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
>       
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
>       



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to