Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Max Burke Fri, 25 Jun 2021 14:27:39 -0700

We've been using binary field types in Parquet and Arrow for WKB-formatted
data and we've been finding that it works very well. Having a geospatial
type in Arrow that allowed an optional SRID to be passed along would be
nice but would be more useful if it came with a corresponding Parquet
logical type annotation too.


On Fri, Jun 25, 2021 at 12:15 PM M. Edward (Ed) Borasky <zn...@znmeb.net>
wrote:

> I don't know about GeoPandas but in R there are two main in-memory GIS
> data types: the old-ish "sp" format and the new "sf" (simple features)
> format. As an R GIS developer, I would expect any Arrow GIS capability
>  to efficiently facilitate "sf" / "tidyverse" operations. See
> https://geocompr.robinlovelace.net/ for the details.
>
> On Fri, Jun 25, 2021 at 11:51 AM Julian Hyde <jhyde.apa...@gmail.com>
> wrote:
> >
> > Cc += geospatial@.
> >
> > I think allowing WKB and WKT is sufficient.
> >
> > Perhaps Geometry could be a composite type (WKT, SRID) or (WKB, SRID).
> SRID (spatial reference identifier) is almost always needed to qualify a
> geometry value. It is analogous to how TimeZone is needed (implicitly or
> explicitly) to qualify a DateTime value.
> >
> > For Geospatial queries to perform well requires some kind of indexing
> (and/or clever data organization). Geospatial indexing is very complex, and
> there is no “one size fits all” approach. So I recommend that Arrow stays
> out of the indexing business, and leaves indexing to the engine.
> >
> > Julian
> >
> >
> > > On Jun 25, 2021, at 10:17 AM, Mauricio Vargas <mavarga...@uc.cl.INVALID>
> wrote:
> > >
> > > Dear Jon
> > >
> > > Thanks for sending this. Based on previous projects, WKB works well
> with
> > > SQLite, DuckDB and others, at the expense of creating heavier size
> columns
> > > compared to PostGIS.
> > >
> > > In order to experiment with, it can be interesting to use the CENSO
> 2017
> > > shape files: https://github.com/ropensci/censo2017-cartografias;
> > >
> https://github.com/ropensci/censo2017-cartografias/releases/download/v0.4/cartografias-censo2017.zip
> > > This includes rivers, streets, etc etc.
> > >
> > > Provided that Arrow is installed in a very straightforward way (for
> > > Windows, at least), creating something based on PostGIS is probably
> not a
> > > bad idea, but WKB works ok, and it integrates with 0 problems with the
> SF
> > > package. I clearly see a great compression advantage here if we decide
> to
> > > use WKB, as LZ4 shall make it very lightweight compared to, say, a CSV.
> > >
> > > Best,
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jun 25, 2021 at 1:05 PM Jonathan Keane <jke...@gmail.com>
> wrote:
> > >
> > >> Hello,
> > >>
> > >> There is an emerging spec[1] for how to store geospatial data in Arrow
> > >> + pass through parquet files in the geopandas world. There is even a
> > >> new R package that implements a wrapper to do the same in R[2]. These
> > >> both define a serialization[3] for storing geospatial data as an Arrow
> > >> table (and thus also when saving to parquet with Arrow).
> > >>
> > >> I could see a number of ways that we might interact with standards
> > >> like these, and for any of these that we pursue it would be good to
> > >> clarify that in our docs:
> > >>
> > >> 1. Point to the standard — we could mention that this standard exists
> > >> and that if someone is building a geospatial data aware application,
> > >> they _could_ refer to this standard if they want to.
> > >> 2. Adopt a/this standard — this could range from stating that we've
> > >> adopted it as the way that spatial data _ought_ to be stored to asking
> > >> the creators if maintaining it within the Arrow project itself would
> > >> be better (either by adopting it or creating a fork — of course
> > >> communication with the folks working on it now would be critical!)
> > >> 3. Create extension type(s) for geospatial data — this would require
> > >> adopting a standard like the one linked, but on top of that providing
> > >> an extension type within Arrow itself that the various clients could
> > >> implement as they saw fit.
> > >> 4. Create new, fully separate type(s) for geospatial data — again,
> > >> this would require adopting a standard of some sort, but we would
> > >> implement it as a specific type and presumably support it in all of
> > >> the clients as we could.
> > >>
> > >> There are of course pros and cons to all of these. This type of data
> > >> *is* somewhat specialized and I don't think we want to have a huge
> > >> profusion of types for all of the possible specialized data types out
> > >> there. But, at a minimum we should acknowledge (or adopt) a standard
> > >> if it exists and encourage implementations that use Arrow to follow
> > >> that standard (like sfarrow does to be compatible with geopandas) so
> > >> that some level of interoperability is there + people aren't needing
> > >> to reinvent the wheel each time they store spatial data.
> > >>
> > >> Thoughts? Are there other projects out there that already do something
> > >> like this with Arrow that we should consider?
> > >>
> > >> [1] https://github.com/geopandas/geo-arrow-spec/pull/2
> > >> [2] https://github.com/wcjochem/sfarrow
> > >> [3] for now they create a binary WKB column + attach a bit of metadata
> > >> to the schema that that's what happened, though there are other ways
> > >> one could encode this and the spec might include other way(s) to store
> > >> this data in the future.
> > >>
> > >> -Jon
> > >>
> > >
> > >
> > > --
> > > —
> > > *Mauricio 'Pachá' Vargas Sepúlveda*
> > > Site: pacha.dev
> > > Blog: pacha.dev/blog
> >
>
>
> --
> Borasky Research Journal https://www.znmeb.mobi
>
> Markovs of the world, unite! You have nothing to lose but your chains!
>


-- 
-Max

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Reply via email to