Hello all,
To make the discussion a bit more concrete and focussed, I've pushed a draft PR to the Parquet format to add a logical EXTENSION type. https://github.com/apache/parquet-format/pull/451 Please feel free to comment. Regards Antoine. On Tue, 28 May 2024 16:45:12 +0200 Antoine Pitrou <[email protected]> wrote: > Hello, > > (NOTE: this comes in the context of > https://github.com/apache/parquet-format/pull/240 -- > "PARQUET-2471: Add geometry logical type") > > I'd like to launch a discussion about the possible addition of > extension types in Parquet. > > Extension types are a concept borrowed from the Arrow type system [1]. > They provide a standard way of conveying more precise information about > the intended type and usage of a given column, without requiring the > metadata format to have a dedicated serialization for that type. > > In Arrow, extension types are typically conveyed through two > string/binary parameters: 1) the extension type name; 2) the > type-specific serialization. The extension type name unambiguously > designates the abstract extension type (such as "Tensor"); the > serialization optionally encodes the extension type's parameters, if > it has any (such as the dimensionality for a "Tensor" type). > > Initially, Arrow extension types tended to be ad hoc and > application-specific, but there is a growing trend to standardize > "canonical extension types" to allow for better data interoperability > accross widely-used data types [2]. > > From my experience as an Arrow PMC member, if Arrow didn't have > extension types, the barrier to propose and standardize new data types > would be much higher, especially for complex proposals such as the > fixed-shape and variable-shape tensor types. > > > For Parquet, extension types would be an alternative to enchristening > additional logical types in the Thrift specification. I can see several > advantages to extension types over additional logical types: > > 1) extension types would make it easier to experiment in dedicated > communities, trying to find out the best possible representation for > some kinds of data (example: the Geoparquet work) > > 2) extension types would allow "soft standardization": an extension type > could first be formally defined by a dedicated community, then > optionally find an official place under the Parquet project. > > 3) extension types would allow defining complex data representations > and semantics without imposing a large burden on the developers of > Parquet implementations, who may not be competent in the target domain. > This includes non-trivial statistics such as bounding boxes for > geospatial data. > > > Technically, I can imagine two possible ways of adding extension types > to the Parquet format: > > 1) as an additional logical type; > 2) as a separate type determination, in addition to the logical type. > > We should also ensure it is possible to express extension-specific > statistics (such as bounding boxes for geospatial data). > > What do you think? > > Regards > > Antoine. > > > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types > > [2] > https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html > > > >
