[ 
https://issues.apache.org/jira/browse/ARROW-16670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542659#comment-17542659
 ] 

Weston Pace edited comment on ARROW-16670 at 5/26/22 8:33 PM:
--------------------------------------------------------------

{quote}I wonder if ignoring the R metadata for query engine output would be a 
better strategy. If it's not the default, it would be nice to provide an escape 
hatch for users or developers that find themselves in this position with no 
workaround.{quote}


This would be my assumption.  The query engine has no idea what metadata is.  
It does not really make any attempt to preserve it.

Sometimes users are doing something like rewriting a file with a different 
chunk size or repartitioning a dataset.  In this case it can sometimes make 
sense to persist the origin metadata.  However, I think the best solution for 
that is to reattach the metadata after it has gone through the query engine.  
The write/sink nodes should have options to attach custom metadata.  We can 
expand on these as needed.


was (Author: westonpace):
{blockquote}
I wonder if ignoring the R metadata for query engine output would be a better 
strategy. If it's not the default, it would be nice to provide an escape hatch 
for users or developers that find themselves in this position with no 
workaround.
{blockquote}

This would be my assumption.  The query engine has no idea what metadata is.  
It does not really make any attempt to preserve it.

Sometimes users are doing something like rewriting a file with a different 
chunk size or repartitioning a dataset.  In this case it can sometimes make 
sense to persist the origin metadata.  However, I think the best solution for 
that is to reattach the metadata after it has gone through the query engine.  
The write/sink nodes should have options to attach custom metadata.  We can 
expand on these as needed.

> [R] Behaviour of R-specific key/value metadata in the query engine
> ------------------------------------------------------------------
>
>                 Key: ARROW-16670
>                 URL: https://issues.apache.org/jira/browse/ARROW-16670
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Dewey Dunnington
>            Priority: Major
>
> In ARROW-16607 there are some changes to metadata handling in the 
> {{arrow_dplyr_query}}. With extension type support, more column types (like 
> sf::sfc) can be supported, and with growing support for column types comes a 
> greater chance that our current metadata restoration by default policy will 
> cause difficult-to-work-around errors. The latest one I have run across is 
> this one:
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> # required for write_dataset(nc) to work
> # remotes::install_github("paleolimbot/geoarrow")
> library(geoarrow)
> library(sf)
> #> Linking to GEOS 3.9.1, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE
> nc <- read_sf(system.file("shape/nc.shp", package = "sf"))
> tf <- tempfile()
> write_dataset(nc, tf)
> open_dataset(tf) %>% 
>   select(NAME, FIPS) %>% 
>   collect()
> #> Error in st_geometry.sf(x): attr(obj, "sf_column") does not point to a 
> geometry column.
> #> Did you rename it, without setting st_geometry(obj) <- "newname"?
> {code}
> This causes an error because the restored class has assumptions about the 
> contents of the data frame that we can't necessarily know about (or would 
> have to hard code for every data frame subclass).
> I can see why {{arrow::write_parquet()}} and {{arrow::read_parquet()}} (and 
> feather, ipc_stream) might want to do this to faithfully roundtrip a data 
> frame, and because the write/read roundtrip (usually) involves the same 
> columns and the same rows, it's probably safe to restore metadata by default.
>  The query engine does a lot of transformations that can break assumptions 
> like the one I've shown above (where sf expects a certain column to exist and 
> errors otherwise in a way that the user can't work around). Rather than 
> hard-code the assumptions of every data.frame and vector subclass, I wonder 
> if ignoring the R metadata for query engine output would be a better 
> strategy. If it's not the default, it would be nice to provide an escape 
> hatch for users or developers that find themselves in this position with no 
> workaround.
> With the addition of the vctrs extension type, there is a route to preserve 
> attributes through the query engine (although it's a bit verbose). We could 
> make it easier to do (e.g., by interpreting `I()` or `rlang::box()` in some 
> way).
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> df <- data.frame(int_col = 1:5)
> attr(df$int_col, "some_attr") <- "some_value"
> tf <- tempfile()
> #  attributes dropped when column is renamed
> write_dataset(df, tf)
> open_dataset(tf) %>% 
>   select(other_int_col = int_col) %>% 
>   collect() %>% 
>   pull()
> #> [1] 1 2 3 4 5
> # attributes preserved when column is renamed
> table <- arrow_table(int_col = vctrs_extension_array(df$int_col))
> write_dataset(table, tf)
> open_dataset(tf) %>% 
>   select(other_int_col = int_col) %>% 
>   collect() %>% 
>   pull()
> #> [1] 1 2 3 4 5
> #> attr(,"some_attr")
> #> [1] "some_value"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to