Jia and I will sync with the Snowflake folks to see if we can have a solution, or roadmap to solution, in the proposal.
Thanks JB for the interest! By the way, I want to schedule a meeting to go over the proposal, it seems there's good feedback from folks from geo side (and even Parquet community), but not too many eyes/feedback from other folks/PMC on Iceberg community. This might be due to lack of familiarity/ time to read through it all. In fact, a lot of the advanced discussions like this one are for Phase 2 items, and Phase 1 items are relatively straightforward, so wanted to explain that. As I know its summer vacation for some folks, we can do this in a week or early July, hope that sounds good with everyone. Thanks, Szehon On Tue, Jun 18, 2024 at 1:54 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi Jia > > Thanks for the update. I'm gonna re-read the whole thread and document to > have a better understanding. > > Thanks ! > Regards > JB > > On Mon, Jun 17, 2024 at 7:44 PM Jia Yu <ji...@apache.org> wrote: > >> Hi Snowflake folks, >> >> Please let me know if you have other questions regarding the proposal. If >> any, Szehon and I can set up a zoom call with you guys to clarify some >> details. We are in the Pacific time zone. If you are in Europe, maybe early >> morning Pacific Time works best for you? >> >> Thanks, >> Jia >> >> On Wed, Jun 5, 2024 at 6:28 PM Gang Wu <ust...@gmail.com> wrote: >> >>> > The min/max stats are discussed in the doc (Phase 2), depending on the >>> non-trivial encoding. >>> >>> Just want to add that min/max stats filtering could be supported by file >>> format natively. Adding geometry type to parquet spec >>> is under discussion: https://github.com/apache/parquet-format/pull/240 >>> >>> Best, >>> Gang >>> >>> On Thu, Jun 6, 2024 at 5:53 AM Szehon Ho <szehon.apa...@gmail.com> >>> wrote: >>> >>>> Hi Peter >>>> >>>> Yes the document only concerns the predicate pushdown of geometric >>>> column. Predicate pushdown takes two forms, 1) partition filter and 2) >>>> min/max stats. The min/max stats are discussed in the doc (Phase 2), >>>> depending on the non-trivial encoding. >>>> >>>> The evaluators are always AND'ed together, so I dont see any issue of >>>> partitioning with another key not working on a table with a geo column. >>>> >>>> On another note, Jia and I thought that we may have a discussion about >>>> Snowflake geo types in a call to drill down on some details? What time >>>> zone are you folks in/ what time works better ? I think Jia and I are both >>>> in Pacific time zone. >>>> >>>> Thanks >>>> Szehon >>>> >>>> On Wed, Jun 5, 2024 at 1:02 AM Peter Popov <peter.po...@snowflake.com> >>>> wrote: >>>> >>>>> Hi Szehon, hi Jia, >>>>> >>>>> Thank you for your replies. We now better understand the connection >>>>> between the metadata and partitioning in this proposal. Supporting the >>>>> Mapping 1 is a great starting point, and we would like to work closer with >>>>> you on bringing the support for spherical edges and other coordinate >>>>> systems into Iceberg geometry. >>>>> >>>>> We have some follow-up questions regarding the partitioning (let us >>>>> know if it’s better to comment directly in the document): Does this >>>>> proposal imply that XZ2 partitioning is always required? In the >>>>> current proposal, do you see a possibility of predicate pushdown to >>>>> rely on x/y min/max column metadata instead of a partition key? We see >>>>> use-cases where a table with a geo column can be partitioned by a >>>>> different >>>>> key(e.g. date) or combination of keys. It would be great to support such >>>>> use cases from the very beginning. >>>>> >>>>> Thanks, >>>>> >>>>> Peter >>>>> >>>>> On Thu, May 30, 2024 at 8:07 AM Jia Yu <ji...@apache.org> wrote: >>>>> >>>>>> Hi Dmtro, >>>>>> >>>>>> Thanks for your email. To add to Szehon's answer, >>>>>> >>>>>> 1. How to represent Snowflake Geometry and Geography type in Iceberg, >>>>>> given the Geo Iceberg Phase 1 design: >>>>>> >>>>>> Answer: >>>>>> Mapping 1 (possible): Snowflake Geometry + SRID: 4326 -> Iceberg >>>>>> Geometry + CRS84 + edges: Planar >>>>>> Mapping 2 (impossible): Snowflake Geography -> Iceberg Geometry + >>>>>> CRS84 + edges: Spherical >>>>>> Mapping 3 (impossible): Snowflake Geometry + SRID:ABCDE-> Iceberg >>>>>> Geometry + SRID:ABCDE + edges: Planar >>>>>> >>>>>> As Szehon mentioned, only Mapping 1 is possible because we need to >>>>>> support spatial query push down in Iceberg. This function relies on the >>>>>> Iceberg partition transform, which requires a 1:1 mapping between a value >>>>>> (point/polygon/linestring) and a partition key. That is: given any >>>>>> precision level, a polygon must produce a single ID; and the covering >>>>>> indicated by this single ID must fully cover the extent of the polygon. >>>>>> Currently, only xz2 can satisfy this requirement. If the theory from >>>>>> Michael Entin can be proven to be correct, then we can support Mapping 2 >>>>>> in >>>>>> Phase 2 of Geo Iceberg. >>>>>> >>>>>> Regarding Mapping 3, this requires Iceberg to be able to understand >>>>>> SRID / PROJJSON such that we will know min max X Y of the CRS (@Szehon, >>>>>> maybe Iceberg can ask the engine to provide this information?). See my >>>>>> answer 2. >>>>>> >>>>>> 2. Why choose projjson instead of SRID? >>>>>> >>>>>> The projjson idea was borrowed from GeoParquet because we'd like to >>>>>> enable possible conversion between Geo Iceberg and GeoParquet. However, I >>>>>> do understand that this is not a good idea for Iceberg since not many >>>>>> libs >>>>>> can parse projjson. >>>>>> >>>>>> @Szehon Is there a way that we can support both SRID and PROJJSON in >>>>>> Geo Iceberg? >>>>>> >>>>>> It is also worth noting that, although there are many libs that can >>>>>> parse SRID and perform look-up in the EPSG database, the license of the >>>>>> EPSG database is NOT compatible with the Apache Software Foundation. That >>>>>> means: Iceberg still cannot parse / understand SRID. >>>>>> >>>>>> Thanks, >>>>>> Jia >>>>>> >>>>>> On Wed, May 29, 2024 at 11:08 AM Szehon Ho <szehon.apa...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Dmytro >>>>>>> >>>>>>> Thank you for looking through the proposal and excited to hear >>>>>>> from you guys! I am not a 'geo expert' and I will definitely need to >>>>>>> pull >>>>>>> in Jia Yu for some of these points. >>>>>>> >>>>>>> Although most calculations are done on the query engine, Iceberg >>>>>>> reference implementations (ie, Java, Python) does have to support a few >>>>>>> calculations to handle filter push down: >>>>>>> >>>>>>> 1. push down of the proposed Geospatial transforms ST_COVERS, >>>>>>> ST_COVERED_BY, and ST_INTERSECTS >>>>>>> 2. evaluation of proposed Geospatial partition transform XZ2. >>>>>>> As you may have seen, this was chosen as its the only standard one >>>>>>> today >>>>>>> that solves the 'boundary object' problem, still preserving 1-to-1 >>>>>>> mapping >>>>>>> of row => partition value. >>>>>>> >>>>>>> This is the primary rationale for choosing the values, as these were >>>>>>> implemented in the GeoLake and Havasu projects (Iceberg forks that >>>>>>> sparked >>>>>>> the proposal) based on Geometry type (edge=planar, crs=OGC:CRS84/ >>>>>>> SRID=4326). >>>>>>> >>>>>>> 2. As you mentioned [2] in the proposal there are difficulties with >>>>>>>> supporting the full PROJSSON specification of the SRS. From our >>>>>>>> experience >>>>>>>> most of the use-cases do not require the full definition of the SRS, in >>>>>>>> fact that definition is only needed when converting between coordinate >>>>>>>> systems. On the other hand, it’s often needed to check whether two >>>>>>>> geometry >>>>>>>> columns have the same coordinate system, for example when joining two >>>>>>>> columns from different data providers. >>>>>>>> >>>>>>>> To address this we would like to propose including the option to >>>>>>>> specify the SRS with only a SRID in phase 1. The query engine may >>>>>>>> choose to >>>>>>>> treat it as opaque identified or make a look-up in the EPSG database of >>>>>>>> supported. >>>>>>>> >>>>>>> >>>>>>> The way to specify CRS definition is actually taken from GeoParquet >>>>>>> [1], I think we are not bound to follow it if there are better options. >>>>>>> I >>>>>>> feel we might need to at least list out supported configurations in the >>>>>>> spec, though. There is some conversation on the doc here about this >>>>>>> [2]. >>>>>>> Basically: >>>>>>> >>>>>>> 1. XZ2 assumes planar edges. This is a feature of the >>>>>>> algorithm, based on the original paper. A possible solution to >>>>>>> spherical >>>>>>> edge is proposed by Michael Entin here: [3], please feel free to >>>>>>> evaluate. >>>>>>> 2. XZ2 needs to know the coordinate range. According to Jia's >>>>>>> comments, this needs parsing of the CRS. Can it be done with SRID >>>>>>> alone? >>>>>>> >>>>>>> >>>>>>>> 1. In the first version of the specification Phase1 it is mentioned >>>>>>>> as the version focused on the planar geometry model with a CRS system >>>>>>>> fixed >>>>>>>> on 4326. In this model, Snowflake would not be able to map our >>>>>>>> Geography >>>>>>>> type since it is based on the spherical Geography model. Given that >>>>>>>> Snowflake supports both edge types, we would like to better understand >>>>>>>> how >>>>>>>> to map them to the proposed Geometry type and its metadata. >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> How is the edge type supposed to be interpreted by the query >>>>>>>> engine? Is it necessary for the system to adhere to the edge model >>>>>>>> for >>>>>>>> geospatial functions, or can it use the model that it supports or >>>>>>>> let the >>>>>>>> customer choose it? Will it affect the bounding box or other row >>>>>>>> group >>>>>>>> metadata >>>>>>>> - >>>>>>>> >>>>>>>> Is there any reason why the flexible model has to be postponed >>>>>>>> to further iterations? Would it be more extensible to support >>>>>>>> mutable edge >>>>>>>> type from the Phase 1, but allow systems to ignore it if they do not >>>>>>>> support the spherical computation model >>>>>>>> >>>>>>>> >>>>>>> It may be answered by the previous paragraph in regards to XZ2. >>>>>>> >>>>>>> 1. If we get XZ2 to work with a more variable CRS without >>>>>>> requiring full PROJJSON specification, it seems it is a path to >>>>>>> support >>>>>>> Snowflake Geometry type? >>>>>>> 2. If we get another one-to-one partition function on spherical >>>>>>> edges, like the one proposed by Michael, it seems a path to support >>>>>>> Snowflake Geography type? >>>>>>> >>>>>>> Does that sound correct? As for why certain things are marked as >>>>>>> Phase 1, they are just chosen so we can all agree on an initial design >>>>>>> and >>>>>>> iterate faster and not set in stone, maybe the path 1 is possible to do >>>>>>> quickly, for example. >>>>>>> >>>>>>> Also , I am not sure about handling evaluation of ST_COVERS, >>>>>>> ST_COVERED_BY, and ST_INTERSECTS (how easy to handle different CRS + >>>>>>> spherical edges). I will leave it to Jia. >>>>>>> >>>>>>> Thanks! >>>>>>> Szehon >>>>>>> >>>>>>> [1]: >>>>>>> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#column-metadata >>>>>>> [2]: >>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk >>>>>>> <https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6xXk> >>>>>>> [3]: >>>>>>> https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit >>>>>>> <https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit> >>>>>>> >>>>>>> >>>>>>> On Wed, May 29, 2024 at 8:30 AM Dmytro Koval >>>>>>> <dmytro.ko...@snowflake.com.invalid> wrote: >>>>>>> >>>>>>>> Dear Szehon and Iceberg Community, >>>>>>>> >>>>>>>> >>>>>>>> This is Dmytro, Peter, Aihua, and Tyler from Snowflake. As part of >>>>>>>> our desire to be more active in the Iceberg community, we’ve been >>>>>>>> looking >>>>>>>> over this geospatial proposal. We’re excited geospatial is getting >>>>>>>> traction, as we see a lot of geo usage within Snowflake, and expect >>>>>>>> that >>>>>>>> usage to carry over to our Iceberg offerings soon. After reviewing the >>>>>>>> proposal, we have some questions we’d like to pose given our experience >>>>>>>> with geospatial support in Snowflake. >>>>>>>> >>>>>>>> We would like to clarify two aspects of the proposal: handling of >>>>>>>> the spherical model and definition of the spatial reference system. >>>>>>>> Both of >>>>>>>> which have a big impact on the interoperability with Snowflake and >>>>>>>> other >>>>>>>> query engines and Geo processing systems. >>>>>>>> >>>>>>>> >>>>>>>> Let us first share some context about geospatial types at >>>>>>>> Snowflake; geo experts will certainly be familiar with this context >>>>>>>> already, but for the sake of others we want to err on the side of being >>>>>>>> explicit and clear. Snowflake supports two Geospatial types [1]: >>>>>>>> - Geography – uses a spherical approximation of the earth for all >>>>>>>> the computations. It does not perfectly represent the earth, but allows >>>>>>>> getting accurate results on WGS84 coordinates, used by GPS without any >>>>>>>> need >>>>>>>> to perform coordinate system reprojections. It is also quite fast for >>>>>>>> end-to-end computations. In general, it has less distortions compared >>>>>>>> to >>>>>>>> the 2d planar model . >>>>>>>> - Geometry – uses planar Euclidean geometry model. Geometric >>>>>>>> computations are simpler, but require transforming the data between >>>>>>>> coordinate systems to minimize the distortion. The Geometry data type >>>>>>>> allows setting a spatial reference system for each row using the SRID. >>>>>>>> The >>>>>>>> binary geospatial functions are only allowed on the geometries with the >>>>>>>> same SRID. The only function that interprets SRID is ST_TRANFORM that >>>>>>>> allows conversion between different SRSs. >>>>>>>> >>>>>>>> Geography >>>>>>>> >>>>>>>> Geometry >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Given the choice of two types and a set of operations on top of >>>>>>>> them, the majority of Snowflake users select the Geography type to >>>>>>>> represent their geospatial data. >>>>>>>> >>>>>>>> From our perspective, Iceberg users would benefit most from being >>>>>>>> given the flexibility to store and process data using the model that >>>>>>>> better >>>>>>>> fits their needs and specific use cases. >>>>>>>> >>>>>>>> Therefore, we would like to ask some design clarifying questions, >>>>>>>> important for interoperability: >>>>>>>> >>>>>>>> >>>>>>>> 1. In the first version of the specification Phase1 it is mentioned >>>>>>>> as the version focused on the planar geometry model with a CRS system >>>>>>>> fixed >>>>>>>> on 4326. In this model, Snowflake would not be able to map our >>>>>>>> Geography >>>>>>>> type since it is based on the spherical Geography model. Given that >>>>>>>> Snowflake supports both edge types, we would like to better understand >>>>>>>> how >>>>>>>> to map them to the proposed Geometry type and its metadata. >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> How is the edge type supposed to be interpreted by the query >>>>>>>> engine? Is it necessary for the system to adhere to the edge model >>>>>>>> for >>>>>>>> geospatial functions, or can it use the model that it supports or >>>>>>>> let the >>>>>>>> customer choose it? Will it affect the bounding box or other row >>>>>>>> group >>>>>>>> metadata >>>>>>>> - >>>>>>>> >>>>>>>> Is there any reason why the flexible model has to be postponed >>>>>>>> to further iterations? Would it be more extensible to support >>>>>>>> mutable edge >>>>>>>> type from the Phase 1, but allow systems to ignore it if they do not >>>>>>>> support the spherical computation model >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2. As you mentioned [2] in the proposal there are difficulties with >>>>>>>> supporting the full PROJSSON specification of the SRS. From our >>>>>>>> experience >>>>>>>> most of the use-cases do not require the full definition of the SRS, in >>>>>>>> fact that definition is only needed when converting between coordinate >>>>>>>> systems. On the other hand, it’s often needed to check whether two >>>>>>>> geometry >>>>>>>> columns have the same coordinate system, for example when joining two >>>>>>>> columns from different data providers. >>>>>>>> >>>>>>>> To address this we would like to propose including the option to >>>>>>>> specify the SRS with only a SRID in phase 1. The query engine may >>>>>>>> choose to >>>>>>>> treat it as opaque identified or make a look-up in the EPSG database of >>>>>>>> supported. >>>>>>>> >>>>>>>> Thank you again for driving this effort forward. We look forward to >>>>>>>> hearing your thoughts. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-geospatial#understanding-the-differences-between-geography-and-geometry >>>>>>>> >>>>>>>> [2] >>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit#heading=h.oruaqt3nxcaf >>>>>>>> >>>>>>>> >>>>>>>> On 2024/05/02 00:41:52 Szehon Ho wrote: >>>>>>>> > Hi everyone, >>>>>>>> > >>>>>>>> > We have created a formal proposal for adding Geospatial support >>>>>>>> to Iceberg. >>>>>>>> > >>>>>>>> > Please read the following for details. >>>>>>>> > >>>>>>>> > - Github Proposal : >>>>>>>> https://github.com/apache/iceberg/issues/10260 >>>>>>>> > - Proposal Doc: >>>>>>>> > >>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI >>>>>>>> > >>>>>>>> > >>>>>>>> > Note that this proposal is built on existing extensive research >>>>>>>> and POC >>>>>>>> > implementations (Geolake, Havasu). Special thanks to Jia Yu and >>>>>>>> Kristin >>>>>>>> > Cowalcijk from Wherobots/Geolake for extensive consultation and >>>>>>>> help in >>>>>>>> > writing this proposal, as well as support from Yuanyuan Zhang >>>>>>>> from Geolake. >>>>>>>> > >>>>>>>> > We would love to get more feedback for this proposal from the >>>>>>>> wider >>>>>>>> > community and eventually discuss this in a community sync. >>>>>>>> > >>>>>>>> > Thanks >>>>>>>> > Szehon >>>>>>>> > >>>>>>>> >>>>>>>