thisisnic commented on PR #46431:
URL: https://github.com/apache/arrow/pull/46431#issuecomment-2906640189

   @amoeba I tried the approach you suggested here but because we use 
`as_arrow_table()` internally in a lot more functions, we end up breaking 
roundtripping with Feather etc. 
   
   I think if we work only in R, we would want to remove the label and then 
restore them later, but trying to find an uncomplicated way of doing this.
   
   I think we definitely want to stop the segfault regardless and error instead.
   
   Users technically can use `mutate()` to change the type to something we can 
work with, *but* there'll be resource costs with doing this on a dataset.  See 
my reprex below.
   
   ``` r
   library(haven)
   library(arrow)
   library(tibble)
   library(dplyr)
   
   d <- tibble(
     a = labelled(x = 1:5),
     b = labelled(x = 11:15)
   )
   
   tf <- tempfile()
   write_parquet(d, tf)
   
   # still fails
   read_parquet(tf, as_data_frame = FALSE) %>%
     filter(a > 3) %>%
     collect()
   #> Error in `compute.arrow_dplyr_query()`:
   #> ! NotImplemented: Function 'greater' has no kernel matching input types 
(<labelled<integer>[0]>, <labelled<integer>[0]>)
   ```
   
   ``` r
   
   tf <- tempfile()
   write_parquet(d, tf)
   
   # works
   read_parquet(tf, as_data_frame = FALSE) %>%
     mutate(a = as.integer(a)) %>%
     filter(a > 3) %>%
     collect()
   #> # A tibble: 2 × 2
   #>       a b        
   #>   <int> <int+lbl>
   #> 1     4 14       
   #> 2     5 15
   ```
   
   ``` r
   
   # fails
   open_dataset(tf) %>%
     mutate(a = as.integer(a)) %>%
     filter(a > 3) %>%
     collect()
   #> Error in `compute.arrow_dplyr_query()`:
   #> ! NotImplemented: Function 'greater_equal' has no kernel matching input 
types (<labelled<integer>[0]>, <labelled<integer>[0]>)
   ```
   
   ``` r
   
   # works but potentially higher resource usage
   open_dataset(tf) %>%
     mutate(a = as.integer(a)) %>%
     compute() %>%
     filter(a > 3) %>%
     collect()
   #> # A tibble: 2 × 2
   #>       a b        
   #>   <int> <int+lbl>
   #> 1     4 14       
   #> 2     5 15
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to