[GitHub] [arrow] paleolimbot edited a comment on pull request #11690: ARROW-13371 [R] binding for make_struct -> StructArray$create()

GitBox Wed, 24 Nov 2021 05:41:33 -0800


paleolimbot edited a comment on pull request #11690:
URL: https://github.com/apache/arrow/pull/11690#issuecomment-977888123



   Ok...summary of the changes:
   
   - This now uses `arrow_not_supported()` for `check.rows` and `row.names` in 
the `data.frame()` translation
   - Added tests for using literals and existing data frames
   
   <details>
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   
   df <- record_batch(a = 1:2)
   
   # "normal"
   df %>% mutate(df_col = tibble(a2 = a)) %>% collect()
   #> # A tibble: 2 × 2
   #>       a df_col$a2
   #>   <int>     <int>
   #> 1     1         1
   #> 2     2         2
   df %>% mutate(df_col = data.frame(a2 = a)) %>% collect()
   #> # A tibble: 2 × 2
   #>       a df_col$a2
   #>   <int>     <int>
   #> 1     1         1
   #> 2     2         2
   
   # scalars and existing data frames
   df %>% mutate(df_col = tibble(a2 = "nested value")) %>% collect()
   #> # A tibble: 2 × 2
   #>       a df_col$a2   
   #>   <int> <chr>       
   #> 1     1 nested value
   #> 2     2 nested value
   one_row_df <- tibble(a2 = "nested value")
   df %>% mutate(df_col = one_row_df) %>% collect()
   #> # A tibble: 2 × 2
   #>       a df_col$a2   
   #>   <int> <chr>       
   #> 1     1 nested value
   #> 2     2 nested value
   
   # this is surprising behaviour (to me) of Scalar$create(c("nested value", 
"nested value2"))
   df %>% mutate(df_col = tibble(a2 = c("nested value", "nested value2"))) %>% 
collect()
   #> # A tibble: 2 × 2
   #>       a         df_col$a2
   #>   <int> <list<character>>
   #> 1     1               [2]
   #> 2     2               [2]
   
   # opened https://issues.apache.org/jira/browse/ARROW-14828 to fix this
   two_row_df <- tibble(a2 = c("nested value", "nested value2"))
   df %>% mutate(df_col = two_row_df) %>% collect()
   #> # A tibble: 2 × 2
   #>       a df_col$a2   
   #>   <int> <chr>       
   #> 1     1 nested value
   #> 2     2 nested value
   
   # duplicated cols
   df %>% mutate(df_col = tibble(a, a)) %>% collect()
   #> Warning: Expression tibble(a, a) not supported in Arrow; pulling data 
into R
   #> Error: Problem with `mutate()` column `df_col`.
   #> ℹ `df_col = tibble(a, a)`.
   #> x Column name `a` must not be duplicated.
   #> Use .name_repair to specify repair.
   df %>% mutate(df_col = data.frame(a, a)) %>% collect()
   #> # A tibble: 2 × 2
   #>       a df_col$a  $a.1
   #>   <int>    <int> <int>
   #> 1     1        1     1
   #> 2     2        2     2
   df %>% mutate(df_col = data.frame(a, a, check.names = FALSE)) %>% collect()
   #> # A tibble: 2 × 2
   #>       a df_col$a    $a
   #>   <int>    <int> <int>
   #> 1     1        1     1
   #> 2     2        2     2
   
   # empty names
   df %>% 
     mutate(df_col = data.frame(a, check.names = TRUE, fix.empty.names = TRUE)) 
%>% 
     collect()
   #> # A tibble: 2 × 2
   #>       a df_col$a
   #>   <int>    <int>
   #> 1     1        1
   #> 2     2        2
   df %>% 
     mutate(df_col = data.frame(a, check.names = TRUE, fix.empty.names = 
FALSE)) %>% 
     collect()
   #> # A tibble: 2 × 2
   #>       a df_col$``
   #>   <int>     <int>
   #> 1     1         1
   #> 2     2         2
   df %>% 
     mutate(df_col = data.frame(a, check.names = FALSE, fix.empty.names = 
TRUE)) %>% 
     collect()
   #> # A tibble: 2 × 2
   #>       a df_col$a
   #>   <int>    <int>
   #> 1     1        1
   #> 2     2        2
   df %>% 
     mutate(df_col = data.frame(a, check.names = FALSE, fix.empty.names = 
FALSE)) %>% 
     collect()
   #> # A tibble: 2 × 2
   #>       a df_col$``
   #>   <int>     <int>
   #> 1     1         1
   #> 2     2         2
   
   # arrow_not_supported
   df %>% mutate(df_col = tibble(a, .rows = 1L)) %>% collect()
   #> Warning: In tibble(a, .rows = 1L), .rows not supported in Arrow; pulling 
data
   #> into R
   #> # A tibble: 2 × 2
   #>       a df_col$a
   #>   <int>    <int>
   #> 1     1        1
   #> 2     2        2
   df %>% mutate(df_col = tibble(a, .name_repair = "universal")) %>% collect()
   #> Warning: In tibble(a, .name_repair = "universal"), .name_repair not 
supported in
   #> Arrow; pulling data into R
   #> # A tibble: 2 × 2
   #>       a df_col$a
   #>   <int>    <int>
   #> 1     1        1
   #> 2     2        2
   df %>% mutate(df_col = data.frame(a, check.rows = TRUE)) %>% collect()
   #> Warning: In data.frame(a, check.rows = TRUE), check.rows not supported in 
Arrow;
   #> pulling data into R
   #> # A tibble: 2 × 2
   #>       a df_col$a
   #>   <int>    <int>
   #> 1     1        1
   #> 2     2        2
   df %>% mutate(df_col = data.frame(a, row.names = TRUE)) %>% collect()
   #> Warning: In data.frame(a, row.names = TRUE), row.names not supported in 
Arrow;
   #> pulling data into R
   #> # A tibble: 2 × 2
   #>       a df_col      
   #>   <int> <named list>
   #> 1     1 <NULL>      
   #> 2     2 <NULL>
   ```
   
   <sup>Created on 2021-11-24 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   </details>
   
   I *didn't* add a test for `mutate(df_col = tibble(a2 = c("nested value", 
"nested value2")))` and `mutate(df_col = two_row_df)` because these both give 
surprising values to me that don't align with what dplyr would give you. I 
think they should be fixed and tested at the `Scalar$create()` level, not here, 
but I'm happy to add in more here with some guidance on the desired behaviour. 
I opened ARROW-14828 for `Scalar$create()` on a two-row data.frame.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] paleolimbot edited a comment on pull request #11690: ARROW-13371 [R] binding for make_struct -> StructArray$create()

Reply via email to