[ https://issues.apache.org/jira/browse/ARROW-12542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Keane closed ARROW-12542. ---------------------------------- Resolution: Duplicate > [R] SF columns in datasets with filters > --------------------------------------- > > Key: ARROW-12542 > URL: https://issues.apache.org/jira/browse/ARROW-12542 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 4.0.0 > Reporter: Jonathan Keane > Assignee: Jonathan Keane > Priority: Critical > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > First reported at > https://issues.apache.org/jira/browse/ARROW-10386?focusedCommentId=17331668&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17331668 > OK, I actually have recreated a similar issue. In the following code, I > create an sf object and write it as a dataset to parquet files. I then call > open_dataset() on the files. > If I collect() the dataset I get back an sf object, no problem. > But if I first filter() the dataset then collect() I get an error. > {code:r} > library(sf) > library(arrow) > library(dplyr) > n <- 10000 > fake <- tibble( > ID=seq(n), > Date=sample(seq(as.Date('2019-01-01'), as.Date('2021-04-01'), by=1), > size=n, replace=TRUE), > x=runif(n=n, min=-170, max=170), > y=runif(n=n, min=-60, max=70), > text1=sample(x=state.name, size=n, replace=TRUE), > text2=sample(x=state.name, size=n, replace=TRUE), > text3=sample(x=state.division, size=n, replace=TRUE), > text4=sample(x=state.region, size=n, replace=TRUE), > text5=sample(x=state.abb, size=n, replace=TRUE), > num1=sample(x=state.center$x, size=n, replace=TRUE), > num2=sample(x=state.center$y, size=n, replace=TRUE), > num3=sample(x=state.area, size=n, replace=TRUE), > Rand1=rnorm(n=n), > Rand2=rnorm(n=n, mean=100, sd=3), > Rand3=rbinom(n=n, size=10, prob=0.4) > ) > # make it into an sf object > spat <- fake %>% > st_as_sf(coords=c('x', 'y'), remove=FALSE, crs = 4326) > class(spat) > class(spat$geometry) > # create new columns for partitioning and write to disk > spat %>% > mutate(Year=lubridate::year(Date), Month=lubridate::month(Date)) %>% > group_by(Year, Month) %>% > write_dataset('data/splits/', format='parquet') > spat_in <- open_dataset('data/splits/') > class(spat_in) > # it's an sf as expected > spat_in %>% collect() %>% class() > spat_in %>% collect() %>% pull(geometry) %>% class() > # it even plots > leaflet::leaflet() %>% > leaflet::addTiles() %>% > leafgl::addGlPoints(data=spat_in %>% collect()) > # but if we filter first > spat_in %>% > filter(Year == 2020 & Month == 2) %>% > collect() > # we get this error > Error in st_geometry.sf(x) : > attr(obj, "sf_column") does not point to a geometry column. > Did you rename it, without setting st_geometry(obj) <- "newname"? > In addition: Warning message: > Invalid metadata$r > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)