Hi Dmytro

Thank you for looking through the proposal and excited to hear from you
guys!  I am not a 'geo expert' and I will definitely need to pull in Jia Yu
for some of these points.

Although most calculations are done on the query engine, Iceberg reference
implementations (ie, Java, Python) does have to support a few calculations
to handle filter push down:

   1. push down of the proposed Geospatial transforms ST_COVERS,
   ST_COVERED_BY, and ST_INTERSECTS
   2. evaluation of proposed Geospatial partition transform XZ2.  As you
   may have seen, this was chosen as its the only standard one today that
   solves the 'boundary object' problem, still preserving 1-to-1 mapping of
   row => partition value.

This is the primary rationale for choosing the values, as these were
implemented in the GeoLake and Havasu projects (Iceberg forks that sparked
the proposal) based on Geometry type (edge=planar, crs=OGC:CRS84/
SRID=4326).

2. As you mentioned [2] in the proposal there are difficulties with
> supporting the full PROJSSON specification of the SRS. From our experience
> most of the use-cases do not require the full definition of the SRS, in
> fact that definition is only needed when converting between coordinate
> systems. On the other hand, it’s often needed to check whether two geometry
> columns have the same coordinate system, for example when joining two
> columns from different data providers.
>
> To address this we would like to propose including the option to specify
> the SRS with only a SRID in phase 1. The query engine may choose to treat
> it as opaque identified or make a look-up in the EPSG database of
> supported.
>

The way to specify CRS definition is actually taken from GeoParquet [1], I
think we are not bound to follow it if there are better options.  I feel we
might need to at least list out supported configurations in the spec,
though.  There is some conversation on the doc here about this [2].
Basically:

   1. XZ2 assumes planar edges.  This is a feature of the algorithm, based
   on the original paper.  A possible solution to spherical edge is proposed
   by Michael Entin here: [3], please feel free to evaluate.
   2. XZ2 needs to know the coordinate range.  According to Jia's comments,
   this needs parsing of the CRS.  Can it be done with SRID alone?


> 1. In the first version of the specification Phase1 it is mentioned as the
> version focused on the planar geometry model with a CRS system fixed on
> 4326. In this model, Snowflake would not be able to map our Geography type
> since it is based on the spherical Geography model. Given that Snowflake
> supports both edge types, we would like to better understand how to map
> them to the proposed Geometry type and its metadata.
>
>    -
>
>    How is the edge type supposed to be interpreted by the query engine?
>    Is it necessary for the system to adhere to the edge model for geospatial
>    functions, or can it use the model that it supports or let the customer
>    choose it? Will it affect the bounding box or other row group metadata
>    -
>
>    Is there any reason why the flexible model has to be postponed to
>    further iterations? Would it be more extensible to support mutable edge
>    type from the Phase 1, but allow systems to ignore it if they do not
>    support the spherical computation model
>
>
It may be answered by the previous paragraph in regards to XZ2.

   1. If we get XZ2 to work with a more variable CRS without requiring full
   PROJJSON specification, it seems it is a path to support Snowflake Geometry
   type?
   2. If we get another one-to-one partition function on spherical edges,
   like the one proposed by Michael, it seems a path to support Snowflake
   Geography type?

Does that sound correct?  As for why certain things are marked as Phase 1,
they are just chosen so we can all agree on an initial design and iterate
faster and not set in stone, maybe the path 1 is possible to do quickly,
for example.

Also , I am not sure about handling evaluation of ST_COVERS, ST_COVERED_BY,
and ST_INTERSECTS (how easy to handle different CRS + spherical edges).  I
will leave it to Jia.

Thanks!
Szehon

[1]:
https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#column-metadata
[2]:
https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk
<https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk>
[3]:
https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit
<https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit>


On Wed, May 29, 2024 at 8:30 AM Dmytro Koval
<dmytro.ko...@snowflake.com.invalid> wrote:

> Dear Szehon and Iceberg Community,
>
>
> This is Dmytro, Peter, Aihua, and Tyler from Snowflake. As part of our
> desire to be more active in the Iceberg community, we’ve been looking over
> this geospatial proposal. We’re excited geospatial is getting traction, as
> we see a lot of geo usage within Snowflake, and expect that usage to carry
> over to our Iceberg offerings soon. After reviewing the proposal, we have
> some questions we’d like to pose given our experience with geospatial
> support in Snowflake.
>
> We would like to clarify two aspects of the proposal: handling of the
> spherical model and definition of the spatial reference system. Both of
> which have a big impact on the interoperability with Snowflake and other
> query engines and Geo processing systems.
>
>
> Let us first share some context about geospatial types at Snowflake; geo
> experts will certainly be familiar with this context already, but for the
> sake of others we want to err on the side of being explicit and clear.
> Snowflake supports two Geospatial types [1]:
> - Geography – uses a spherical approximation of the earth for all the
> computations. It does not perfectly represent the earth, but allows getting
> accurate results on WGS84 coordinates, used by GPS without any need to
> perform coordinate system reprojections. It is also quite fast for
> end-to-end computations. In general, it has less distortions compared to
> the 2d planar model .
> - Geometry – uses planar Euclidean geometry model. Geometric computations
> are simpler, but require transforming the data between coordinate systems
> to minimize the distortion. The Geometry data type allows setting a spatial
> reference system for each row using the SRID. The binary geospatial
> functions are only allowed on the geometries with the same SRID. The only
> function that interprets SRID is ST_TRANFORM that allows conversion between
> different SRSs.
>
> Geography
>
> Geometry
>
>
>
> Given the choice of two types and a set of operations on top of them, the
> majority of Snowflake users select the Geography type to represent their
> geospatial data.
>
> From our perspective, Iceberg users would benefit most from being given
> the flexibility to store and process data using the model that better fits
> their needs and specific use cases.
>
> Therefore, we would like to ask some design clarifying questions,
> important for interoperability:
>
>
> 1. In the first version of the specification Phase1 it is mentioned as the
> version focused on the planar geometry model with a CRS system fixed on
> 4326. In this model, Snowflake would not be able to map our Geography type
> since it is based on the spherical Geography model. Given that Snowflake
> supports both edge types, we would like to better understand how to map
> them to the proposed Geometry type and its metadata.
>
>    -
>
>    How is the edge type supposed to be interpreted by the query engine?
>    Is it necessary for the system to adhere to the edge model for geospatial
>    functions, or can it use the model that it supports or let the customer
>    choose it? Will it affect the bounding box or other row group metadata
>    -
>
>    Is there any reason why the flexible model has to be postponed to
>    further iterations? Would it be more extensible to support mutable edge
>    type from the Phase 1, but allow systems to ignore it if they do not
>    support the spherical computation model
>
>
>
> 2. As you mentioned [2] in the proposal there are difficulties with
> supporting the full PROJSSON specification of the SRS. From our experience
> most of the use-cases do not require the full definition of the SRS, in
> fact that definition is only needed when converting between coordinate
> systems. On the other hand, it’s often needed to check whether two geometry
> columns have the same coordinate system, for example when joining two
> columns from different data providers.
>
> To address this we would like to propose including the option to specify
> the SRS with only a SRID in phase 1. The query engine may choose to treat
> it as opaque identified or make a look-up in the EPSG database of
> supported.
>
> Thank you again for driving this effort forward. We look forward to
> hearing your thoughts.
>
> [1]
> https://docs.snowflake.com/en/sql-reference/data-types-geospatial#understanding-the-differences-between-geography-and-geometry
>
> [2]
> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit#heading=h.oruaqt3nxcaf
>
>
> On 2024/05/02 00:41:52 Szehon Ho wrote:
> > Hi everyone,
> >
> > We have created a formal proposal for adding Geospatial support to
> Iceberg.
> >
> > Please read the following for details.
> >
> >    - Github Proposal : https://github.com/apache/iceberg/issues/10260
> >    - Proposal Doc:
> >
> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI
> >
> >
> > Note that this proposal is built on existing extensive research and POC
> > implementations (Geolake, Havasu).  Special thanks to Jia Yu and Kristin
> > Cowalcijk from Wherobots/Geolake for extensive consultation and help in
> > writing this proposal, as well as support from Yuanyuan Zhang from
> Geolake.
> >
> > We would love to get more feedback for this proposal from the wider
> > community and eventually discuss this in a community sync.
> >
> > Thanks
> > Szehon
> >
>

Reply via email to