Seems like there is general consensus on
https://github.com/apache/parquet-format/pull/560

I'll aim to merge tomorrow unless there is more feedback.

On Sun, Apr 12, 2026 at 4:51 PM Micah Kornfield <[email protected]>
wrote:

> I took a quick look and triggered CI.  I had one suggestion but otherwise
> from my limited knowledge it looked reasonable.
>
> On Tue, Apr 7, 2026 at 8:58 AM Milan Stefanovic <
> [email protected]> wrote:
>
>> Thanks everyone for the review.
>>
>> Can some committers review as well, so we can finalize this ?
>>
>> Thanks,
>> Milan
>>
>> On Wed, 1 Apr 2026 at 14:26, Milan Stefanovic <
>> [email protected]>
>> wrote:
>>
>> > Thanks for the explanation Dewey!
>> >
>> > I've opened PR:
>> > https://github.com/apache/parquet-format/pull/560
>> >
>> > Let me know what you think!
>> >
>> > P.S. - If you know any relevant party in geo community, lets involve
>> them
>> > explicitly as well.
>> >
>> > cc: @Jia Yu <[email protected]>
>> >
>> > Thanks,
>> > Milan
>> >
>> > On Sat, 28 Mar 2026 at 03:32, Dewey Dunnington <
>> [email protected]>
>> > wrote:
>> >
>> >> Hi Milan,
>> >>
>> >> > Dewey, you mentioned current writers using inline strings - what are
>> >> they
>> >> > inlining ? are they inlining projjsons or authority:identifiers ?
>> >>
>> >> The writers are writing the CRS representation they receive, which for
>> >> Arrow C++ and arrow-rs  comes from the geoarrow.wkb extension type
>> >> metadata [1]. This is usually PROJJSON but is permitted to be a string
>> >> (including authority:code). If you run
>> >> pyarrow.parquet.write_table(pyarrow.table(geopandas_geof.to_arrow()))
>> >> today, you will get a Parquet file with an inlined PROJJSON CRS,
>> >> because that is how GeoPandas encodes CRSes when converting to Arrow.
>> >>
>> >> > reality of current implementations is such
>> >> > that most implementations do write `authorithy:identifier`, spec
>> should
>> >> be
>> >> > written so that at least it doesn't look like thats invalid.
>> >>
>> >> The reality of current implementations is that they are writing
>> >> PROJJSON, although I would also happily support a rewording that adds
>> >> authority:code to the recommended options list.
>> >>
>> >> > Arent EPSG:<number> also understood to map directly to
>> >> > corresponding PROJJSON definition ?
>> >>
>> >> They can be mapped to a PROJJSON definition (or a number of other less
>> >> friendly export formats) using a database with the licensing ambiguity
>> >> Jia mentioned. Conversely, PROJJSON can be mapped to authority:code
>> >> with some minimal JSON parsing (we do this in Arrow C++ and arrow-rs
>> >> to canonically remove CRS definitions that correspond to lon/lat to
>> >> produce more universally consumable Parquet files).
>> >>
>> >> Cheers,
>> >>
>> >> -dewey
>> >>
>> >> [1] https://geoarrow.org/extension-types.html#extension-metadata
>> >>
>> >> On Fri, Mar 27, 2026 at 3:52 PM Milan Stefanovic
>> >> <[email protected]> wrote:
>> >> >
>> >> >  Thanks Jia and Dewey,
>> >> >
>> >> > Dewey, you mentioned current writers using inline strings - what are
>> >> they
>> >> > inlining ? are they inlining projjsons or authority:identifiers ?
>> >> > Given that current implementations avoided using srid:<number> and
>> >> > projjson:<field_ref> perhaps we should remove these examples from
>> spec
>> >> as
>> >> > they seem to bring some confusion.
>> >> >
>> >> > @Jia Yu <[email protected]>, you mentioned that OGC:CRS84 are
>> >> understood to
>> >> > map directly to its corresponding PROJJSON definition.
>> >> > Arent EPSG:<number> also understood to map directly to
>> >> > corresponding PROJJSON definition ?
>> >> >
>> >> > Also I'm fine with not being explicit about `authorithy:identifier`
>> if
>> >> that
>> >> > was the prior consensus, but if reality of current implementations is
>> >> such
>> >> > that most implementations do write `authorithy:identifier`, spec
>> should
>> >> be
>> >> > written so that at least it doesn't look like thats invalid.
>> >> >
>> >> > What are your thoughts?
>> >> >
>> >> > Milan
>> >> >
>> >> > On Wed, 25 Mar 2026 at 15:53, Dewey Dunnington <
>> >> [email protected]>
>> >> > wrote:
>> >> >
>> >> > > Hi Milan,
>> >> > >
>> >> > > A short answer is that the current language of the spec does not
>> >> > > forbid writing "OGC:CRS84" to the CRS field (which is "just a
>> string"
>> >> > > as far as thrift is concerned). All existing readers that I know
>> about
>> >> > > (DuckDB, arrow-rs, Arrow C++, GDAL) will accept that string and
>> >> > > interpret it unambiguously on read (for example,
>> >> > > `GeoPandas.from_arrow(pyarrow.parquet.read_table(...))` works).
>> There
>> >> > > is also an example file in parquet-testing that covers this case
>> >> > > (arbitrary string that is neither of the recommended options) [1].
>> I
>> >> > > put together a small example script to demonstrate the read path
>> for
>> >> > > the tools I mentioned [2].
>> >> > >
>> >> > > Jia is correct that the GeoParquet community will require writing
>> an
>> >> > > inline PROJJSON string in the forthcoming 2.0 version of the
>> >> > > specification [3]. This was a pragmatic decision that reflects the
>> >> > > needs of existing GeoParquet users because:
>> >> > >
>> >> > > - srid does not explicitly name the EPSG database, so any code
>> written
>> >> > > there does not have an unambiguous interpretation (even if it did
>> it
>> >> > > would place ambiguous licencing and/or dependency requirements on
>> >> > > consumers)
>> >> > > - projjson:some_field was not pragmatic to implement on the write
>> side
>> >> > > for either of the implementations I was involved in (C++ and Rust).
>> >> > > Implementations just don't expose the global key/value metadata
>> when
>> >> > > converting types and doing so would have required breaking changes
>> in
>> >> > > the APIs. There are also ambiguities with respect to existing
>> >> > > propagation of schema metadata (i.e., the projjson schema key is
>> often
>> >> > > propagated in unexpected ways into pyarrow and beyond, including
>> being
>> >> > > written into the key/value metadata of a resulting Parquet file).
>> >> > >
>> >> > > As a result, most of the tools that can write GEOMETRY and
>> GEOGRAPHY
>> >> > > (Arrow C++, GDAL, arrow-rs are currently writing inline strings
>> >> > > (because inline strings are what is available in the representation
>> >> > > passed to Arrow-based writers and this was better than omitting CRS
>> >> > > information). For all the implementations I was involved in, we
>> also
>> >> > > try to explicitly omit the CRS when we detect that the string we
>> were
>> >> > > passed is lon/lat (i.e., if they see "OGC:CRS84", they write an
>> >> > > omitted CRS to minimize the need for consumers to be CRS aware).
>> >> > >
>> >> > > I'll echo Jia's comment that none of us are keen to reopen a CRS
>> >> > > discussion but I also agree that the current language of the spec
>> is
>> >> > > vague and doesn't reflect the reality of the ecosystem as it has
>> >> > > evolved. I'm happy to review any PRs to improve the language or
>> >> > > implementations :)
>> >> > >
>> >> > > Cheers,
>> >> > >
>> >> > > -dewey
>> >> > >
>> >> > > [1]
>> >> > >
>> >>
>> https://github.com/apache/parquet-testing/tree/master/data/geospatial#geospatial-test-files
>> >> > > [2]
>> >> https://gist.github.com/paleolimbot/7759e58bf1f98ecf8f2c459367bbdeda
>> >> > > [3]
>> >> > >
>> >>
>> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#crs-parquet-property
>> >> > >
>> >> > > On Wed, Mar 25, 2026 at 12:49 AM Jia Yu <[email protected]> wrote:
>> >> > > >
>> >> > > > Hi Milan,
>> >> > > >
>> >> > > > The authority:identifier pattern was explicitly rejected in prior
>> >> > > > community discussions. The core concern is that it forces query
>> >> > > > engines to rely on external registries to resolve CRS
>> definitions,
>> >> > > > which breaks the goal of self-contained data. More importantly,
>> the
>> >> > > > most widely used authority, the EPSG database, comes with
>> licensing
>> >> > > > terms that are not particularly open-source friendly:
>> >> > > > https://epsg.org/terms-of-use.html
>> >> > > >
>> >> > > > As a result, the community has leaned toward requiring data
>> writers
>> >> to
>> >> > > > use a fully self-contained CRS representation such as PROJJSON.
>> In
>> >> > > > that model, a reference like OGC:CRS84 is understood to map
>> directly
>> >> > > > to its corresponding PROJJSON definition, as outlined in the
>> >> > > > GeoParquet specification:
>> >> > > >
>> >> > >
>> >>
>> https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#ogccrs84-details
>> >> > > >
>> >> > > > That said, this expectation is not clearly spelled out in the
>> >> Parquet
>> >> > > > and Iceberg specifications, which leaves some ambiguity in
>> practice.
>> >> > > >
>> >> > > > I don’t have a strong stance either way. In fact, I can see the
>> case
>> >> > > > for allowing authority:identifier. But it’s worth noting that
>> >> > > > introducing it now would likely reopen a fairly contentious
>> >> discussion
>> >> > > > in the community.
>> >> > > >
>> >> > > > Jia
>> >> > > >
>> >> > > > On Tue, Mar 24, 2026 at 10:09 AM Milan Stefanovic
>> >> > > > <[email protected]> wrote:
>> >> > > > >
>> >> > > > > Hi everyone,
>> >> > > > >
>> >> > > > > I’m looking for some clarification (and potentially a small
>> spec
>> >> > > update)
>> >> > > > > regarding the Geospatial Physical Types documentation -
>> >> > > > > https://parquet.apache.org/docs/file-format/types/geospatial/,
>> >> > > specifically
>> >> > > > > the CRS Customization section.
>> >> > > > >
>> >> > > > > 1) The Confusion
>> >> > > > >
>> >> > > > > Currently, the spec states that custom CRS values should follow
>> >> the
>> >> > > > > `type:identifier` format, where type is either `srid` or
>> >> `projjson` -
>> >> > > > > (e.g., `srid:4326` or `projjson:property_name`). The spec also
>> >> defines
>> >> > > the
>> >> > > > > default CRS as `OGC:CRS84`.
>> >> > > > >
>> >> > > > > Depending on how the specification is read, the reader may
>> >> consider as
>> >> > > > > valid CRS definition to be only strings of the form `srid:<some
>> >> > > number>` or
>> >> > > > > `projjson:<property name>`, which implies that `OGC:CRS84` does
>> >> not
>> >> > > adhere
>> >> > > > > to the rules defined in the customization section. This creates
>> >> > > confusion
>> >> > > > > for implementers: should the type string always be parsed as a
>> >> strict
>> >> > > > > "custom" format which necessitates the srid: prefix?
>> >> > > > >
>> >> > > > > 2) The Suggestion
>> >> > > > >
>> >> > > > > I suggest we update the language to be explicit about allowed
>> >> formats
>> >> > > for
>> >> > > > > CRS, and my suggestion is that we break it down like this:
>> >> > > > >    - Standard CRS: Any string from a known authority in a
>> format
>> >> of
>> >> > > > > `<authority>:<identifier>` (e.g., `EPSG:4326`, `OGC:CRS84`,
>> >> > > `ESRI:102100`)
>> >> > > > > is accepted.
>> >> > > > >    - Custom CRS: in the format of `type:identifier`
>> >> > > > >          - `srid:1234`: The definition resides in a
>> local/database
>> >> > > spatial
>> >> > > > > reference table.
>> >> > > > >          - `projjson:key`: The definition is stored in Parquet
>> >> > > file/table
>> >> > > > > metadata.
>> >> > > > >
>> >> > > > > This would validate `OGC:CRS84` as a first-class string while
>> >> > > providing a
>> >> > > > > clear "escape hatch" for custom definitions.
>> >> > > > >
>> >> > > > > What are your thoughts ?
>> >> > > > >
>> >> > > > > Kind regards,
>> >> > > > > Milan
>> >> > >
>> >>
>> >
>>
>

Reply via email to