Jia and I will sync with the Snowflake folks to see if we can have a
solution, or roadmap to solution, in the proposal.

Thanks JB for the interest!  By the way, I want to schedule a meeting to go
over the proposal, it seems there's good feedback from folks from geo side
(and even Parquet community), but not too many eyes/feedback from other
folks/PMC on Iceberg community.  This might be due to lack of familiarity/
time to read through it all.  In fact, a lot of the advanced discussions
like this one are for Phase 2 items, and Phase 1 items are relatively
straightforward, so wanted to explain that.  As I know its summer vacation
for some folks, we can do this in a week or early July, hope that sounds
good with everyone.

Thanks,
Szehon

On Tue, Jun 18, 2024 at 1:54 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Jia
>
> Thanks for the update. I'm gonna re-read the whole thread and document to
> have a better understanding.
>
> Thanks !
> Regards
> JB
>
> On Mon, Jun 17, 2024 at 7:44 PM Jia Yu <ji...@apache.org> wrote:
>
>> Hi Snowflake folks,
>>
>> Please let me know if you have other questions regarding the proposal. If
>> any, Szehon and I can set up a zoom call with you guys to clarify some
>> details. We are in the Pacific time zone. If you are in Europe, maybe early
>> morning Pacific Time works best for you?
>>
>> Thanks,
>> Jia
>>
>> On Wed, Jun 5, 2024 at 6:28 PM Gang Wu <ust...@gmail.com> wrote:
>>
>>> > The min/max stats are discussed in the doc (Phase 2), depending on the
>>> non-trivial encoding.
>>>
>>> Just want to add that min/max stats filtering could be supported by file
>>> format natively. Adding geometry type to parquet spec
>>> is under discussion: https://github.com/apache/parquet-format/pull/240
>>>
>>> Best,
>>> Gang
>>>
>>> On Thu, Jun 6, 2024 at 5:53 AM Szehon Ho <szehon.apa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Peter
>>>>
>>>> Yes the document only concerns the predicate pushdown of geometric
>>>> column.  Predicate pushdown takes two forms, 1) partition filter and 2)
>>>> min/max stats.  The min/max stats are discussed in the doc (Phase 2),
>>>> depending on the non-trivial encoding.
>>>>
>>>> The evaluators are always AND'ed together, so I dont see any issue of
>>>> partitioning with another key not working on a table with a geo column.
>>>>
>>>> On another note, Jia and I thought that we may have a discussion about
>>>> Snowflake geo types in a call to drill down on some details?  What time
>>>> zone are you folks in/ what time works better ?  I think Jia and I are both
>>>> in Pacific time zone.
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>> On Wed, Jun 5, 2024 at 1:02 AM Peter Popov <peter.po...@snowflake.com>
>>>> wrote:
>>>>
>>>>> Hi Szehon, hi Jia,
>>>>>
>>>>> Thank you for your replies. We now better understand the connection
>>>>> between the metadata and partitioning in this proposal. Supporting the
>>>>> Mapping 1 is a great starting point, and we would like to work closer with
>>>>> you on bringing the support for spherical edges and other coordinate
>>>>> systems into Iceberg geometry.
>>>>>
>>>>> We have some follow-up questions regarding the partitioning (let us
>>>>> know if it’s better to comment directly in the document): Does this
>>>>> proposal imply that XZ2 partitioning is always required? In the
>>>>> current proposal, do you see a possibility of predicate pushdown to
>>>>> rely on x/y min/max column metadata instead of a partition key? We see
>>>>> use-cases where a table with a geo column can be partitioned by a 
>>>>> different
>>>>> key(e.g. date) or combination of keys. It would be great to support such
>>>>> use cases from the very beginning.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Peter
>>>>>
>>>>> On Thu, May 30, 2024 at 8:07 AM Jia Yu <ji...@apache.org> wrote:
>>>>>
>>>>>> Hi Dmtro,
>>>>>>
>>>>>> Thanks for your email. To add to Szehon's answer,
>>>>>>
>>>>>> 1. How to represent Snowflake Geometry and Geography type in Iceberg,
>>>>>> given the Geo Iceberg Phase 1 design:
>>>>>>
>>>>>> Answer:
>>>>>> Mapping 1 (possible): Snowflake Geometry + SRID: 4326 -> Iceberg
>>>>>> Geometry + CRS84 + edges: Planar
>>>>>> Mapping 2 (impossible): Snowflake Geography -> Iceberg Geometry +
>>>>>> CRS84 + edges: Spherical
>>>>>> Mapping 3 (impossible): Snowflake Geometry + SRID:ABCDE-> Iceberg
>>>>>> Geometry + SRID:ABCDE + edges: Planar
>>>>>>
>>>>>> As Szehon mentioned, only Mapping 1 is possible because we need to
>>>>>> support spatial query push down in Iceberg. This function relies on the
>>>>>> Iceberg partition transform, which requires a 1:1 mapping between a value
>>>>>> (point/polygon/linestring) and a partition key. That is: given any
>>>>>> precision level, a polygon must produce a single ID; and the covering
>>>>>> indicated by this single ID must fully cover the extent of the polygon.
>>>>>> Currently, only xz2 can satisfy this requirement. If the theory from
>>>>>> Michael Entin can be proven to be correct, then we can support Mapping 2 
>>>>>> in
>>>>>> Phase 2 of Geo Iceberg.
>>>>>>
>>>>>> Regarding Mapping 3, this requires Iceberg to be able to understand
>>>>>> SRID / PROJJSON such that we will know min max X Y of the CRS (@Szehon,
>>>>>> maybe Iceberg can ask the engine to provide this information?). See my
>>>>>> answer 2.
>>>>>>
>>>>>> 2. Why choose projjson instead of SRID?
>>>>>>
>>>>>> The projjson idea was borrowed from GeoParquet because we'd like to
>>>>>> enable possible conversion between Geo Iceberg and GeoParquet. However, I
>>>>>> do understand that this is not a good idea for Iceberg since not many 
>>>>>> libs
>>>>>> can parse projjson.
>>>>>>
>>>>>> @Szehon Is there a way that we can support both SRID and PROJJSON in
>>>>>> Geo Iceberg?
>>>>>>
>>>>>> It is also worth noting that, although there are many libs that can
>>>>>> parse SRID and perform look-up in the EPSG database, the license of the
>>>>>> EPSG database is NOT compatible with the Apache Software Foundation. That
>>>>>> means: Iceberg still cannot parse / understand SRID.
>>>>>>
>>>>>> Thanks,
>>>>>> Jia
>>>>>>
>>>>>> On Wed, May 29, 2024 at 11:08 AM Szehon Ho <szehon.apa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Dmytro
>>>>>>>
>>>>>>> Thank you for looking through the proposal and excited to hear
>>>>>>> from you guys!  I am not a 'geo expert' and I will definitely need to 
>>>>>>> pull
>>>>>>> in Jia Yu for some of these points.
>>>>>>>
>>>>>>> Although most calculations are done on the query engine, Iceberg
>>>>>>> reference implementations (ie, Java, Python) does have to support a few
>>>>>>> calculations to handle filter push down:
>>>>>>>
>>>>>>>    1. push down of the proposed Geospatial transforms ST_COVERS,
>>>>>>>    ST_COVERED_BY, and ST_INTERSECTS
>>>>>>>    2. evaluation of proposed Geospatial partition transform XZ2.
>>>>>>>    As you may have seen, this was chosen as its the only standard one 
>>>>>>> today
>>>>>>>    that solves the 'boundary object' problem, still preserving 1-to-1 
>>>>>>> mapping
>>>>>>>    of row => partition value.
>>>>>>>
>>>>>>> This is the primary rationale for choosing the values, as these were
>>>>>>> implemented in the GeoLake and Havasu projects (Iceberg forks that 
>>>>>>> sparked
>>>>>>> the proposal) based on Geometry type (edge=planar, crs=OGC:CRS84/
>>>>>>> SRID=4326).
>>>>>>>
>>>>>>> 2. As you mentioned [2] in the proposal there are difficulties with
>>>>>>>> supporting the full PROJSSON specification of the SRS. From our 
>>>>>>>> experience
>>>>>>>> most of the use-cases do not require the full definition of the SRS, in
>>>>>>>> fact that definition is only needed when converting between coordinate
>>>>>>>> systems. On the other hand, it’s often needed to check whether two 
>>>>>>>> geometry
>>>>>>>> columns have the same coordinate system, for example when joining two
>>>>>>>> columns from different data providers.
>>>>>>>>
>>>>>>>> To address this we would like to propose including the option to
>>>>>>>> specify the SRS with only a SRID in phase 1. The query engine may 
>>>>>>>> choose to
>>>>>>>> treat it as opaque identified or make a look-up in the EPSG database of
>>>>>>>> supported.
>>>>>>>>
>>>>>>>
>>>>>>> The way to specify CRS definition is actually taken from GeoParquet
>>>>>>> [1], I think we are not bound to follow it if there are better options. 
>>>>>>>  I
>>>>>>> feel we might need to at least list out supported configurations in the
>>>>>>> spec, though.  There is some conversation on the doc here about this 
>>>>>>> [2].
>>>>>>> Basically:
>>>>>>>
>>>>>>>    1. XZ2 assumes planar edges.  This is a feature of the
>>>>>>>    algorithm, based on the original paper.  A possible solution to 
>>>>>>> spherical
>>>>>>>    edge is proposed by Michael Entin here: [3], please feel free to 
>>>>>>> evaluate.
>>>>>>>    2. XZ2 needs to know the coordinate range.  According to Jia's
>>>>>>>    comments, this needs parsing of the CRS.  Can it be done with SRID 
>>>>>>> alone?
>>>>>>>
>>>>>>>
>>>>>>>> 1. In the first version of the specification Phase1 it is mentioned
>>>>>>>> as the version focused on the planar geometry model with a CRS system 
>>>>>>>> fixed
>>>>>>>> on 4326. In this model, Snowflake would not be able to map our 
>>>>>>>> Geography
>>>>>>>> type since it is based on the spherical Geography model. Given that
>>>>>>>> Snowflake supports both edge types, we would like to better understand 
>>>>>>>> how
>>>>>>>> to map them to the proposed Geometry type and its metadata.
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    How is the edge type supposed to be interpreted by the query
>>>>>>>>    engine? Is it necessary for the system to adhere to the edge model 
>>>>>>>> for
>>>>>>>>    geospatial functions, or can it use the model that it supports or 
>>>>>>>> let the
>>>>>>>>    customer choose it? Will it affect the bounding box or other row 
>>>>>>>> group
>>>>>>>>    metadata
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Is there any reason why the flexible model has to be postponed
>>>>>>>>    to further iterations? Would it be more extensible to support 
>>>>>>>> mutable edge
>>>>>>>>    type from the Phase 1, but allow systems to ignore it if they do not
>>>>>>>>    support the spherical computation model
>>>>>>>>
>>>>>>>>
>>>>>>> It may be answered by the previous paragraph in regards to XZ2.
>>>>>>>
>>>>>>>    1. If we get XZ2 to work with a more variable CRS without
>>>>>>>    requiring full PROJJSON specification, it seems it is a path to 
>>>>>>> support
>>>>>>>    Snowflake Geometry type?
>>>>>>>    2. If we get another one-to-one partition function on spherical
>>>>>>>    edges, like the one proposed by Michael, it seems a path to support
>>>>>>>    Snowflake Geography type?
>>>>>>>
>>>>>>> Does that sound correct?  As for why certain things are marked as
>>>>>>> Phase 1, they are just chosen so we can all agree on an initial design 
>>>>>>> and
>>>>>>> iterate faster and not set in stone, maybe the path 1 is possible to do
>>>>>>> quickly, for example.
>>>>>>>
>>>>>>> Also , I am not sure about handling evaluation of ST_COVERS,
>>>>>>> ST_COVERED_BY, and ST_INTERSECTS (how easy to handle different CRS +
>>>>>>> spherical edges).  I will leave it to Jia.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Szehon
>>>>>>>
>>>>>>> [1]:
>>>>>>> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#column-metadata
>>>>>>> [2]:
>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk
>>>>>>> <https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk>
>>>>>>> [3]:
>>>>>>> https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit
>>>>>>> <https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 29, 2024 at 8:30 AM Dmytro Koval
>>>>>>> <dmytro.ko...@snowflake.com.invalid> wrote:
>>>>>>>
>>>>>>>> Dear Szehon and Iceberg Community,
>>>>>>>>
>>>>>>>>
>>>>>>>> This is Dmytro, Peter, Aihua, and Tyler from Snowflake. As part of
>>>>>>>> our desire to be more active in the Iceberg community, we’ve been 
>>>>>>>> looking
>>>>>>>> over this geospatial proposal. We’re excited geospatial is getting
>>>>>>>> traction, as we see a lot of geo usage within Snowflake, and expect 
>>>>>>>> that
>>>>>>>> usage to carry over to our Iceberg offerings soon. After reviewing the
>>>>>>>> proposal, we have some questions we’d like to pose given our experience
>>>>>>>> with geospatial support in Snowflake.
>>>>>>>>
>>>>>>>> We would like to clarify two aspects of the proposal: handling of
>>>>>>>> the spherical model and definition of the spatial reference system. 
>>>>>>>> Both of
>>>>>>>> which have a big impact on the interoperability with Snowflake and 
>>>>>>>> other
>>>>>>>> query engines and Geo processing systems.
>>>>>>>>
>>>>>>>>
>>>>>>>> Let us first share some context about geospatial types at
>>>>>>>> Snowflake; geo experts will certainly be familiar with this context
>>>>>>>> already, but for the sake of others we want to err on the side of being
>>>>>>>> explicit and clear. Snowflake supports two Geospatial types [1]:
>>>>>>>> - Geography – uses a spherical approximation of the earth for all
>>>>>>>> the computations. It does not perfectly represent the earth, but allows
>>>>>>>> getting accurate results on WGS84 coordinates, used by GPS without any 
>>>>>>>> need
>>>>>>>> to perform coordinate system reprojections. It is also quite fast for
>>>>>>>> end-to-end computations. In general, it has less distortions compared 
>>>>>>>> to
>>>>>>>> the 2d planar model .
>>>>>>>> - Geometry – uses planar Euclidean geometry model. Geometric
>>>>>>>> computations are simpler, but require transforming the data between
>>>>>>>> coordinate systems to minimize the distortion. The Geometry data type
>>>>>>>> allows setting a spatial reference system for each row using the SRID. 
>>>>>>>> The
>>>>>>>> binary geospatial functions are only allowed on the geometries with the
>>>>>>>> same SRID. The only function that interprets SRID is ST_TRANFORM that
>>>>>>>> allows conversion between different SRSs.
>>>>>>>>
>>>>>>>> Geography
>>>>>>>>
>>>>>>>> Geometry
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Given the choice of two types and a set of operations on top of
>>>>>>>> them, the majority of Snowflake users select the Geography type to
>>>>>>>> represent their geospatial data.
>>>>>>>>
>>>>>>>> From our perspective, Iceberg users would benefit most from being
>>>>>>>> given the flexibility to store and process data using the model that 
>>>>>>>> better
>>>>>>>> fits their needs and specific use cases.
>>>>>>>>
>>>>>>>> Therefore, we would like to ask some design clarifying questions,
>>>>>>>> important for interoperability:
>>>>>>>>
>>>>>>>>
>>>>>>>> 1. In the first version of the specification Phase1 it is mentioned
>>>>>>>> as the version focused on the planar geometry model with a CRS system 
>>>>>>>> fixed
>>>>>>>> on 4326. In this model, Snowflake would not be able to map our 
>>>>>>>> Geography
>>>>>>>> type since it is based on the spherical Geography model. Given that
>>>>>>>> Snowflake supports both edge types, we would like to better understand 
>>>>>>>> how
>>>>>>>> to map them to the proposed Geometry type and its metadata.
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    How is the edge type supposed to be interpreted by the query
>>>>>>>>    engine? Is it necessary for the system to adhere to the edge model 
>>>>>>>> for
>>>>>>>>    geospatial functions, or can it use the model that it supports or 
>>>>>>>> let the
>>>>>>>>    customer choose it? Will it affect the bounding box or other row 
>>>>>>>> group
>>>>>>>>    metadata
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Is there any reason why the flexible model has to be postponed
>>>>>>>>    to further iterations? Would it be more extensible to support 
>>>>>>>> mutable edge
>>>>>>>>    type from the Phase 1, but allow systems to ignore it if they do not
>>>>>>>>    support the spherical computation model
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. As you mentioned [2] in the proposal there are difficulties with
>>>>>>>> supporting the full PROJSSON specification of the SRS. From our 
>>>>>>>> experience
>>>>>>>> most of the use-cases do not require the full definition of the SRS, in
>>>>>>>> fact that definition is only needed when converting between coordinate
>>>>>>>> systems. On the other hand, it’s often needed to check whether two 
>>>>>>>> geometry
>>>>>>>> columns have the same coordinate system, for example when joining two
>>>>>>>> columns from different data providers.
>>>>>>>>
>>>>>>>> To address this we would like to propose including the option to
>>>>>>>> specify the SRS with only a SRID in phase 1. The query engine may 
>>>>>>>> choose to
>>>>>>>> treat it as opaque identified or make a look-up in the EPSG database of
>>>>>>>> supported.
>>>>>>>>
>>>>>>>> Thank you again for driving this effort forward. We look forward to
>>>>>>>> hearing your thoughts.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-geospatial#understanding-the-differences-between-geography-and-geometry
>>>>>>>>
>>>>>>>> [2]
>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit#heading=h.oruaqt3nxcaf
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/05/02 00:41:52 Szehon Ho wrote:
>>>>>>>> > Hi everyone,
>>>>>>>> >
>>>>>>>> > We have created a formal proposal for adding Geospatial support
>>>>>>>> to Iceberg.
>>>>>>>> >
>>>>>>>> > Please read the following for details.
>>>>>>>> >
>>>>>>>> >    - Github Proposal :
>>>>>>>> https://github.com/apache/iceberg/issues/10260
>>>>>>>> >    - Proposal Doc:
>>>>>>>> >
>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Note that this proposal is built on existing extensive research
>>>>>>>> and POC
>>>>>>>> > implementations (Geolake, Havasu).  Special thanks to Jia Yu and
>>>>>>>> Kristin
>>>>>>>> > Cowalcijk from Wherobots/Geolake for extensive consultation and
>>>>>>>> help in
>>>>>>>> > writing this proposal, as well as support from Yuanyuan Zhang
>>>>>>>> from Geolake.
>>>>>>>> >
>>>>>>>> > We would love to get more feedback for this proposal from the
>>>>>>>> wider
>>>>>>>> > community and eventually discuss this in a community sync.
>>>>>>>> >
>>>>>>>> > Thanks
>>>>>>>> > Szehon
>>>>>>>> >
>>>>>>>>
>>>>>>>

Reply via email to