[ 
https://issues.apache.org/jira/browse/ARROW-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331668#comment-17331668
 ] 

Jared Lander commented on ARROW-10386:
--------------------------------------

OK, I actually have recreated a similar issue. In the following code, I create 
an sf object and write it as a dataset to parquet files. I then call 
open_dataset() on the files.

If I collect() the dataset I get back an sf object, no problem.

But if I first filter() the dataset then collect() I get an error.
{code:java}
library(sf)
library(arrow)
library(dplyr)

n <- 10000

fake <- tibble(
    ID=seq(n),
    Date=sample(seq(as.Date('2019-01-01'), as.Date('2021-04-01'), by=1), 
size=n, replace=TRUE),
    x=runif(n=n, min=-170, max=170),
    y=runif(n=n, min=-60, max=70),
    text1=sample(x=state.name, size=n, replace=TRUE),
    text2=sample(x=state.name, size=n, replace=TRUE),
    text3=sample(x=state.division, size=n, replace=TRUE),
    text4=sample(x=state.region, size=n, replace=TRUE),
    text5=sample(x=state.abb, size=n, replace=TRUE),
    num1=sample(x=state.center$x, size=n, replace=TRUE),
    num2=sample(x=state.center$y, size=n, replace=TRUE),
    num3=sample(x=state.area, size=n, replace=TRUE),
    Rand1=rnorm(n=n),
    Rand2=rnorm(n=n, mean=100, sd=3),
    Rand3=rbinom(n=n, size=10, prob=0.4)
)

# make it into an sf object
spat <- fake %>% 
    st_as_sf(coords=c('x', 'y'), remove=FALSE, crs = 4326)

class(spat)
class(spat$geometry)

# create new columns for partitioning and write to disk
spat %>% 
    mutate(Year=lubridate::year(Date), Month=lubridate::month(Date)) %>% 
    group_by(Year, Month) %>% 
    write_dataset('data/splits/', format='parquet')

spat_in <- open_dataset('data/splits/')

class(spat_in)

# it's an sf as expected
spat_in %>% collect() %>% class()
spat_in %>% collect() %>% pull(geometry) %>% class()

# it even plots
leaflet::leaflet() %>% 
    leaflet::addTiles() %>% 
    leafgl::addGlPoints(data=spat_in %>% collect())

# but if we filter first
spat_in %>% 
    filter(Year == 2020 & Month == 2) %>% 
    collect()

# we get this error
Error in st_geometry.sf(x) : 
  attr(obj, "sf_column") does not point to a geometry column.
Did you rename it, without setting st_geometry(obj) <- "newname"?
In addition: Warning message:
Invalid metadata$r {code}

> [R] List column class attributes not preserved in roundtrip
> -----------------------------------------------------------
>
>                 Key: ARROW-10386
>                 URL: https://issues.apache.org/jira/browse/ARROW-10386
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 2.0.0
>         Environment: Mac OS 10.15.7
> R 4.0.2
> arrow 2.0
> sf 0.9-6
>            Reporter: Petr Bouchal
>            Assignee: Romain Francois
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Hi all - thanks for the improvement addressed in ARROW-9271.
> In arrow 2.0 spatial data (class sf) now retains metadata at column level, 
> but still does not roundtrip correctly as metadata (attributes) are lost at 
> the level of individual elements of the list-columns; at least I think that 
> is the problem as that is where I can see changes in the metadata.) Is this 
> something that is addressable?
> See reprex below on what happens + what attributes exist at the element level.
> FWIW a workaround with spatial data using sf would be to convert to WKT 
> before writing it out (sf::st_as_text()). It might be useful to note this 
> somewhere in the docs.
> This is using arrow 2.0 and sf 0.9-6.
> Reproducible example:
> {code:R}
>  library(arrow)
>  #> 
>  #> Attaching package: 'arrow'
>  #> The following object is masked from 'package:utils':
>  #> 
>  #> timestamp
>  library(sf)
>  #> Linking to GEOS 3.8.1, GDAL 3.1.1, PROJ 6.3.1
> fname <- system.file("shape/nc.shp", package="sf")
>  df_spatial <- st_read(fname)
>  #> Reading layer `nc' from data source 
> `/Users/petr/Library/R/4.0/library/sf/shape/nc.shp' using driver `ESRI 
> Shapefile'
>  #> Simple feature collection with 100 features and 14 fields
>  #> geometry type: MULTIPOLYGON
>  #> dimension: XY
>  #> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
>  #> geographic CRS: NAD27
> write_parquet(df_spatial, "spatial.parquet")
>  roundtripped <- read_parquet("spatial.parquet")
>  roundtripped
>  #> Simple feature collection with 100 features and 14 fields
>  #> geometry type: MULTIPOLYGON
>  #> dimension: arrow_list
>  #> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
>  #> geographic CRS: NAD27
>  #> First 10 features:
>  #> Error in vapply(lst, class, rep(NA_character_, 3)): values must be length 
> 3,
>  #> but FUN(X[[1]]) result is length 1
> attributes(roundtripped$geometry[[1]])
>  #> $class
>  #> [1] "arrow_list" "vctrs_list_of" "vctrs_vctr" "list" 
>  #> 
>  #> $ptype
>  #> <list<double>[0]>
> attributes(df_spatial$geometry[[1]])
>  #> $class
>  #> [1] "XY" "MULTIPOLYGON" "sfg"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to