[ https://issues.apache.org/jira/browse/ARROW-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331668#comment-17331668 ]
Jared Lander commented on ARROW-10386: -------------------------------------- OK, I actually have recreated a similar issue. In the following code, I create an sf object and write it as a dataset to parquet files. I then call open_dataset() on the files. If I collect() the dataset I get back an sf object, no problem. But if I first filter() the dataset then collect() I get an error. {code:java} library(sf) library(arrow) library(dplyr) n <- 10000 fake <- tibble( ID=seq(n), Date=sample(seq(as.Date('2019-01-01'), as.Date('2021-04-01'), by=1), size=n, replace=TRUE), x=runif(n=n, min=-170, max=170), y=runif(n=n, min=-60, max=70), text1=sample(x=state.name, size=n, replace=TRUE), text2=sample(x=state.name, size=n, replace=TRUE), text3=sample(x=state.division, size=n, replace=TRUE), text4=sample(x=state.region, size=n, replace=TRUE), text5=sample(x=state.abb, size=n, replace=TRUE), num1=sample(x=state.center$x, size=n, replace=TRUE), num2=sample(x=state.center$y, size=n, replace=TRUE), num3=sample(x=state.area, size=n, replace=TRUE), Rand1=rnorm(n=n), Rand2=rnorm(n=n, mean=100, sd=3), Rand3=rbinom(n=n, size=10, prob=0.4) ) # make it into an sf object spat <- fake %>% st_as_sf(coords=c('x', 'y'), remove=FALSE, crs = 4326) class(spat) class(spat$geometry) # create new columns for partitioning and write to disk spat %>% mutate(Year=lubridate::year(Date), Month=lubridate::month(Date)) %>% group_by(Year, Month) %>% write_dataset('data/splits/', format='parquet') spat_in <- open_dataset('data/splits/') class(spat_in) # it's an sf as expected spat_in %>% collect() %>% class() spat_in %>% collect() %>% pull(geometry) %>% class() # it even plots leaflet::leaflet() %>% leaflet::addTiles() %>% leafgl::addGlPoints(data=spat_in %>% collect()) # but if we filter first spat_in %>% filter(Year == 2020 & Month == 2) %>% collect() # we get this error Error in st_geometry.sf(x) : attr(obj, "sf_column") does not point to a geometry column. Did you rename it, without setting st_geometry(obj) <- "newname"? In addition: Warning message: Invalid metadata$r {code} > [R] List column class attributes not preserved in roundtrip > ----------------------------------------------------------- > > Key: ARROW-10386 > URL: https://issues.apache.org/jira/browse/ARROW-10386 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 2.0.0 > Environment: Mac OS 10.15.7 > R 4.0.2 > arrow 2.0 > sf 0.9-6 > Reporter: Petr Bouchal > Assignee: Romain Francois > Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 4h 20m > Remaining Estimate: 0h > > Hi all - thanks for the improvement addressed in ARROW-9271. > In arrow 2.0 spatial data (class sf) now retains metadata at column level, > but still does not roundtrip correctly as metadata (attributes) are lost at > the level of individual elements of the list-columns; at least I think that > is the problem as that is where I can see changes in the metadata.) Is this > something that is addressable? > See reprex below on what happens + what attributes exist at the element level. > FWIW a workaround with spatial data using sf would be to convert to WKT > before writing it out (sf::st_as_text()). It might be useful to note this > somewhere in the docs. > This is using arrow 2.0 and sf 0.9-6. > Reproducible example: > {code:R} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > library(sf) > #> Linking to GEOS 3.8.1, GDAL 3.1.1, PROJ 6.3.1 > fname <- system.file("shape/nc.shp", package="sf") > df_spatial <- st_read(fname) > #> Reading layer `nc' from data source > `/Users/petr/Library/R/4.0/library/sf/shape/nc.shp' using driver `ESRI > Shapefile' > #> Simple feature collection with 100 features and 14 fields > #> geometry type: MULTIPOLYGON > #> dimension: XY > #> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 > #> geographic CRS: NAD27 > write_parquet(df_spatial, "spatial.parquet") > roundtripped <- read_parquet("spatial.parquet") > roundtripped > #> Simple feature collection with 100 features and 14 fields > #> geometry type: MULTIPOLYGON > #> dimension: arrow_list > #> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 > #> geographic CRS: NAD27 > #> First 10 features: > #> Error in vapply(lst, class, rep(NA_character_, 3)): values must be length > 3, > #> but FUN(X[[1]]) result is length 1 > attributes(roundtripped$geometry[[1]]) > #> $class > #> [1] "arrow_list" "vctrs_list_of" "vctrs_vctr" "list" > #> > #> $ptype > #> <list<double>[0]> > attributes(df_spatial$geometry[[1]]) > #> $class > #> [1] "XY" "MULTIPOLYGON" "sfg" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)