Jonathan Keane created ARROW-12542: -------------------------------------- Summary: [R] SF columns in datasets with filters Key: ARROW-12542 URL: https://issues.apache.org/jira/browse/ARROW-12542 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Jonathan Keane Assignee: Jonathan Keane
First reported at https://issues.apache.org/jira/browse/ARROW-10386?focusedCommentId=17331668&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17331668 OK, I actually have recreated a similar issue. In the following code, I create an sf object and write it as a dataset to parquet files. I then call open_dataset() on the files. If I collect() the dataset I get back an sf object, no problem. But if I first filter() the dataset then collect() I get an error. {code:r} library(sf) library(arrow) library(dplyr) n <- 10000 fake <- tibble( ID=seq(n), Date=sample(seq(as.Date('2019-01-01'), as.Date('2021-04-01'), by=1), size=n, replace=TRUE), x=runif(n=n, min=-170, max=170), y=runif(n=n, min=-60, max=70), text1=sample(x=state.name, size=n, replace=TRUE), text2=sample(x=state.name, size=n, replace=TRUE), text3=sample(x=state.division, size=n, replace=TRUE), text4=sample(x=state.region, size=n, replace=TRUE), text5=sample(x=state.abb, size=n, replace=TRUE), num1=sample(x=state.center$x, size=n, replace=TRUE), num2=sample(x=state.center$y, size=n, replace=TRUE), num3=sample(x=state.area, size=n, replace=TRUE), Rand1=rnorm(n=n), Rand2=rnorm(n=n, mean=100, sd=3), Rand3=rbinom(n=n, size=10, prob=0.4) ) # make it into an sf object spat <- fake %>% st_as_sf(coords=c('x', 'y'), remove=FALSE, crs = 4326) class(spat) class(spat$geometry) # create new columns for partitioning and write to disk spat %>% mutate(Year=lubridate::year(Date), Month=lubridate::month(Date)) %>% group_by(Year, Month) %>% write_dataset('data/splits/', format='parquet') spat_in <- open_dataset('data/splits/') class(spat_in) # it's an sf as expected spat_in %>% collect() %>% class() spat_in %>% collect() %>% pull(geometry) %>% class() # it even plots leaflet::leaflet() %>% leaflet::addTiles() %>% leafgl::addGlPoints(data=spat_in %>% collect()) # but if we filter first spat_in %>% filter(Year == 2020 & Month == 2) %>% collect() # we get this error Error in st_geometry.sf(x) : attr(obj, "sf_column") does not point to a geometry column. Did you rename it, without setting st_geometry(obj) <- "newname"? In addition: Warning message: Invalid metadata$r {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)