Seems like there is general consensus on https://github.com/apache/parquet-format/pull/560
I'll aim to merge tomorrow unless there is more feedback. On Sun, Apr 12, 2026 at 4:51 PM Micah Kornfield <[email protected]> wrote: > I took a quick look and triggered CI. I had one suggestion but otherwise > from my limited knowledge it looked reasonable. > > On Tue, Apr 7, 2026 at 8:58 AM Milan Stefanovic < > [email protected]> wrote: > >> Thanks everyone for the review. >> >> Can some committers review as well, so we can finalize this ? >> >> Thanks, >> Milan >> >> On Wed, 1 Apr 2026 at 14:26, Milan Stefanovic < >> [email protected]> >> wrote: >> >> > Thanks for the explanation Dewey! >> > >> > I've opened PR: >> > https://github.com/apache/parquet-format/pull/560 >> > >> > Let me know what you think! >> > >> > P.S. - If you know any relevant party in geo community, lets involve >> them >> > explicitly as well. >> > >> > cc: @Jia Yu <[email protected]> >> > >> > Thanks, >> > Milan >> > >> > On Sat, 28 Mar 2026 at 03:32, Dewey Dunnington < >> [email protected]> >> > wrote: >> > >> >> Hi Milan, >> >> >> >> > Dewey, you mentioned current writers using inline strings - what are >> >> they >> >> > inlining ? are they inlining projjsons or authority:identifiers ? >> >> >> >> The writers are writing the CRS representation they receive, which for >> >> Arrow C++ and arrow-rs comes from the geoarrow.wkb extension type >> >> metadata [1]. This is usually PROJJSON but is permitted to be a string >> >> (including authority:code). If you run >> >> pyarrow.parquet.write_table(pyarrow.table(geopandas_geof.to_arrow())) >> >> today, you will get a Parquet file with an inlined PROJJSON CRS, >> >> because that is how GeoPandas encodes CRSes when converting to Arrow. >> >> >> >> > reality of current implementations is such >> >> > that most implementations do write `authorithy:identifier`, spec >> should >> >> be >> >> > written so that at least it doesn't look like thats invalid. >> >> >> >> The reality of current implementations is that they are writing >> >> PROJJSON, although I would also happily support a rewording that adds >> >> authority:code to the recommended options list. >> >> >> >> > Arent EPSG:<number> also understood to map directly to >> >> > corresponding PROJJSON definition ? >> >> >> >> They can be mapped to a PROJJSON definition (or a number of other less >> >> friendly export formats) using a database with the licensing ambiguity >> >> Jia mentioned. Conversely, PROJJSON can be mapped to authority:code >> >> with some minimal JSON parsing (we do this in Arrow C++ and arrow-rs >> >> to canonically remove CRS definitions that correspond to lon/lat to >> >> produce more universally consumable Parquet files). >> >> >> >> Cheers, >> >> >> >> -dewey >> >> >> >> [1] https://geoarrow.org/extension-types.html#extension-metadata >> >> >> >> On Fri, Mar 27, 2026 at 3:52 PM Milan Stefanovic >> >> <[email protected]> wrote: >> >> > >> >> > Thanks Jia and Dewey, >> >> > >> >> > Dewey, you mentioned current writers using inline strings - what are >> >> they >> >> > inlining ? are they inlining projjsons or authority:identifiers ? >> >> > Given that current implementations avoided using srid:<number> and >> >> > projjson:<field_ref> perhaps we should remove these examples from >> spec >> >> as >> >> > they seem to bring some confusion. >> >> > >> >> > @Jia Yu <[email protected]>, you mentioned that OGC:CRS84 are >> >> understood to >> >> > map directly to its corresponding PROJJSON definition. >> >> > Arent EPSG:<number> also understood to map directly to >> >> > corresponding PROJJSON definition ? >> >> > >> >> > Also I'm fine with not being explicit about `authorithy:identifier` >> if >> >> that >> >> > was the prior consensus, but if reality of current implementations is >> >> such >> >> > that most implementations do write `authorithy:identifier`, spec >> should >> >> be >> >> > written so that at least it doesn't look like thats invalid. >> >> > >> >> > What are your thoughts? >> >> > >> >> > Milan >> >> > >> >> > On Wed, 25 Mar 2026 at 15:53, Dewey Dunnington < >> >> [email protected]> >> >> > wrote: >> >> > >> >> > > Hi Milan, >> >> > > >> >> > > A short answer is that the current language of the spec does not >> >> > > forbid writing "OGC:CRS84" to the CRS field (which is "just a >> string" >> >> > > as far as thrift is concerned). All existing readers that I know >> about >> >> > > (DuckDB, arrow-rs, Arrow C++, GDAL) will accept that string and >> >> > > interpret it unambiguously on read (for example, >> >> > > `GeoPandas.from_arrow(pyarrow.parquet.read_table(...))` works). >> There >> >> > > is also an example file in parquet-testing that covers this case >> >> > > (arbitrary string that is neither of the recommended options) [1]. >> I >> >> > > put together a small example script to demonstrate the read path >> for >> >> > > the tools I mentioned [2]. >> >> > > >> >> > > Jia is correct that the GeoParquet community will require writing >> an >> >> > > inline PROJJSON string in the forthcoming 2.0 version of the >> >> > > specification [3]. This was a pragmatic decision that reflects the >> >> > > needs of existing GeoParquet users because: >> >> > > >> >> > > - srid does not explicitly name the EPSG database, so any code >> written >> >> > > there does not have an unambiguous interpretation (even if it did >> it >> >> > > would place ambiguous licencing and/or dependency requirements on >> >> > > consumers) >> >> > > - projjson:some_field was not pragmatic to implement on the write >> side >> >> > > for either of the implementations I was involved in (C++ and Rust). >> >> > > Implementations just don't expose the global key/value metadata >> when >> >> > > converting types and doing so would have required breaking changes >> in >> >> > > the APIs. There are also ambiguities with respect to existing >> >> > > propagation of schema metadata (i.e., the projjson schema key is >> often >> >> > > propagated in unexpected ways into pyarrow and beyond, including >> being >> >> > > written into the key/value metadata of a resulting Parquet file). >> >> > > >> >> > > As a result, most of the tools that can write GEOMETRY and >> GEOGRAPHY >> >> > > (Arrow C++, GDAL, arrow-rs are currently writing inline strings >> >> > > (because inline strings are what is available in the representation >> >> > > passed to Arrow-based writers and this was better than omitting CRS >> >> > > information). For all the implementations I was involved in, we >> also >> >> > > try to explicitly omit the CRS when we detect that the string we >> were >> >> > > passed is lon/lat (i.e., if they see "OGC:CRS84", they write an >> >> > > omitted CRS to minimize the need for consumers to be CRS aware). >> >> > > >> >> > > I'll echo Jia's comment that none of us are keen to reopen a CRS >> >> > > discussion but I also agree that the current language of the spec >> is >> >> > > vague and doesn't reflect the reality of the ecosystem as it has >> >> > > evolved. I'm happy to review any PRs to improve the language or >> >> > > implementations :) >> >> > > >> >> > > Cheers, >> >> > > >> >> > > -dewey >> >> > > >> >> > > [1] >> >> > > >> >> >> https://github.com/apache/parquet-testing/tree/master/data/geospatial#geospatial-test-files >> >> > > [2] >> >> https://gist.github.com/paleolimbot/7759e58bf1f98ecf8f2c459367bbdeda >> >> > > [3] >> >> > > >> >> >> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#crs-parquet-property >> >> > > >> >> > > On Wed, Mar 25, 2026 at 12:49 AM Jia Yu <[email protected]> wrote: >> >> > > > >> >> > > > Hi Milan, >> >> > > > >> >> > > > The authority:identifier pattern was explicitly rejected in prior >> >> > > > community discussions. The core concern is that it forces query >> >> > > > engines to rely on external registries to resolve CRS >> definitions, >> >> > > > which breaks the goal of self-contained data. More importantly, >> the >> >> > > > most widely used authority, the EPSG database, comes with >> licensing >> >> > > > terms that are not particularly open-source friendly: >> >> > > > https://epsg.org/terms-of-use.html >> >> > > > >> >> > > > As a result, the community has leaned toward requiring data >> writers >> >> to >> >> > > > use a fully self-contained CRS representation such as PROJJSON. >> In >> >> > > > that model, a reference like OGC:CRS84 is understood to map >> directly >> >> > > > to its corresponding PROJJSON definition, as outlined in the >> >> > > > GeoParquet specification: >> >> > > > >> >> > > >> >> >> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#ogccrs84-details >> >> > > > >> >> > > > That said, this expectation is not clearly spelled out in the >> >> Parquet >> >> > > > and Iceberg specifications, which leaves some ambiguity in >> practice. >> >> > > > >> >> > > > I don’t have a strong stance either way. In fact, I can see the >> case >> >> > > > for allowing authority:identifier. But it’s worth noting that >> >> > > > introducing it now would likely reopen a fairly contentious >> >> discussion >> >> > > > in the community. >> >> > > > >> >> > > > Jia >> >> > > > >> >> > > > On Tue, Mar 24, 2026 at 10:09 AM Milan Stefanovic >> >> > > > <[email protected]> wrote: >> >> > > > > >> >> > > > > Hi everyone, >> >> > > > > >> >> > > > > I’m looking for some clarification (and potentially a small >> spec >> >> > > update) >> >> > > > > regarding the Geospatial Physical Types documentation - >> >> > > > > https://parquet.apache.org/docs/file-format/types/geospatial/, >> >> > > specifically >> >> > > > > the CRS Customization section. >> >> > > > > >> >> > > > > 1) The Confusion >> >> > > > > >> >> > > > > Currently, the spec states that custom CRS values should follow >> >> the >> >> > > > > `type:identifier` format, where type is either `srid` or >> >> `projjson` - >> >> > > > > (e.g., `srid:4326` or `projjson:property_name`). The spec also >> >> defines >> >> > > the >> >> > > > > default CRS as `OGC:CRS84`. >> >> > > > > >> >> > > > > Depending on how the specification is read, the reader may >> >> consider as >> >> > > > > valid CRS definition to be only strings of the form `srid:<some >> >> > > number>` or >> >> > > > > `projjson:<property name>`, which implies that `OGC:CRS84` does >> >> not >> >> > > adhere >> >> > > > > to the rules defined in the customization section. This creates >> >> > > confusion >> >> > > > > for implementers: should the type string always be parsed as a >> >> strict >> >> > > > > "custom" format which necessitates the srid: prefix? >> >> > > > > >> >> > > > > 2) The Suggestion >> >> > > > > >> >> > > > > I suggest we update the language to be explicit about allowed >> >> formats >> >> > > for >> >> > > > > CRS, and my suggestion is that we break it down like this: >> >> > > > > - Standard CRS: Any string from a known authority in a >> format >> >> of >> >> > > > > `<authority>:<identifier>` (e.g., `EPSG:4326`, `OGC:CRS84`, >> >> > > `ESRI:102100`) >> >> > > > > is accepted. >> >> > > > > - Custom CRS: in the format of `type:identifier` >> >> > > > > - `srid:1234`: The definition resides in a >> local/database >> >> > > spatial >> >> > > > > reference table. >> >> > > > > - `projjson:key`: The definition is stored in Parquet >> >> > > file/table >> >> > > > > metadata. >> >> > > > > >> >> > > > > This would validate `OGC:CRS84` as a first-class string while >> >> > > providing a >> >> > > > > clear "escape hatch" for custom definitions. >> >> > > > > >> >> > > > > What are your thoughts ? >> >> > > > > >> >> > > > > Kind regards, >> >> > > > > Milan >> >> > > >> >> >> > >> >
