Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Jim Hughes Mon, 28 Jun 2021 09:07:31 -0700

Hi all,

I'd add two points:

First, while indexing is complex, does Arrow maintain/create stats oncolumns of data? If so, capturing an MBR of the geometries would beawesome to help with very rough filtering/pruning.

Second, I'm very keen to understand the connections between Arrow andParquet that the spec PR discusses. I work on GeoMesa, we've addedspatial support to Arrow, Parquet, Orc, and Spark. Spatial dataframesin Spark can be saved to Parquet and Orc. I mention that to point outthat the "round trip" consideration can be extended to at least thesefour Apache projects.

Recently, I've seen another developer starting from scratch on addingParquet support to Sedona and another group interested in sorting out a"GeoParquet" format. It seems like we ought to sort out a place todiscuss this.

Would the geospat...@apache.org mailing list be a good place to discussthings?


Cheers,

Jim

On 6/25/2021 2:50 PM, Julian Hyde wrote:

Cc += geospatial@.

I think allowing WKB and WKT is sufficient.

Perhaps Geometry could be a composite type (WKT, SRID) or (WKB, SRID). SRID 
(spatial reference identifier) is almost always needed to qualify a geometry 
value. It is analogous to how TimeZone is needed (implicitly or explicitly) to 
qualify a DateTime value.

For Geospatial queries to perform well requires some kind of indexing (and/or 
clever data organization). Geospatial indexing is very complex, and there is no 
“one size fits all” approach. So I recommend that Arrow stays out of the 
indexing business, and leaves indexing to the engine.

Julian

On Jun 25, 2021, at 10:17 AM, Mauricio Vargas <mavarga...@uc.cl.INVALID> wrote:

Dear Jon

Thanks for sending this. Based on previous projects, WKB works well with
SQLite, DuckDB and others, at the expense of creating heavier size columns
compared to PostGIS.

In order to experiment with, it can be interesting to use the CENSO 2017
shape files: https://github.com/ropensci/censo2017-cartografias;
https://github.com/ropensci/censo2017-cartografias/releases/download/v0.4/cartografias-censo2017.zip
This includes rivers, streets, etc etc.

Provided that Arrow is installed in a very straightforward way (for
Windows, at least), creating something based on PostGIS is probably not a
bad idea, but WKB works ok, and it integrates with 0 problems with the SF
package. I clearly see a great compression advantage here if we decide to
use WKB, as LZ4 shall make it very lightweight compared to, say, a CSV.

Best,







On Fri, Jun 25, 2021 at 1:05 PM Jonathan Keane <jke...@gmail.com> wrote:

Hello,

There is an emerging spec[1] for how to store geospatial data in Arrow
+ pass through parquet files in the geopandas world. There is even a
new R package that implements a wrapper to do the same in R[2]. These
both define a serialization[3] for storing geospatial data as an Arrow
table (and thus also when saving to parquet with Arrow).

I could see a number of ways that we might interact with standards
like these, and for any of these that we pursue it would be good to
clarify that in our docs:

1. Point to the standard — we could mention that this standard exists
and that if someone is building a geospatial data aware application,
they _could_ refer to this standard if they want to.
2. Adopt a/this standard — this could range from stating that we've
adopted it as the way that spatial data _ought_ to be stored to asking
the creators if maintaining it within the Arrow project itself would
be better (either by adopting it or creating a fork — of course
communication with the folks working on it now would be critical!)
3. Create extension type(s) for geospatial data — this would require
adopting a standard like the one linked, but on top of that providing
an extension type within Arrow itself that the various clients could
implement as they saw fit.
4. Create new, fully separate type(s) for geospatial data — again,
this would require adopting a standard of some sort, but we would
implement it as a specific type and presumably support it in all of
the clients as we could.

There are of course pros and cons to all of these. This type of data
*is* somewhat specialized and I don't think we want to have a huge
profusion of types for all of the possible specialized data types out
there. But, at a minimum we should acknowledge (or adopt) a standard
if it exists and encourage implementations that use Arrow to follow
that standard (like sfarrow does to be compatible with geopandas) so
that some level of interoperability is there + people aren't needing
to reinvent the wheel each time they store spatial data.

Thoughts? Are there other projects out there that already do something
like this with Arrow that we should consider?

[1] https://github.com/geopandas/geo-arrow-spec/pull/2
[2] https://github.com/wcjochem/sfarrow
[3] for now they create a binary WKB column + attach a bit of metadata
to the schema that that's what happened, though there are other ways
one could encode this and the spec might include other way(s) to store
this data in the future.

-Jon


--
—
*Mauricio 'Pachá' Vargas Sepúlveda*
Site: pacha.dev
Blog: pacha.dev/blog

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Reply via email to