Hello all,

To make the discussion a bit more concrete and focussed, I've pushed a
draft PR to the Parquet format to add a logical EXTENSION type.
https://github.com/apache/parquet-format/pull/451

Please feel free to comment.

Regards

Antoine.


On Tue, 28 May 2024 16:45:12 +0200
Antoine Pitrou <[email protected]> wrote:
> Hello,
> 
> (NOTE: this comes in the context of
> https://github.com/apache/parquet-format/pull/240 --
> "PARQUET-2471: Add geometry logical type")
> 
> I'd like to launch a discussion about the possible addition of
> extension types in Parquet.
> 
> Extension types are a concept borrowed from the Arrow type system [1].
> They provide a standard way of conveying more precise information about
> the intended type and usage of a given column, without requiring the
> metadata format to have a dedicated serialization for that type.
> 
> In Arrow, extension types are typically conveyed through two
> string/binary parameters: 1) the extension type name; 2) the
> type-specific serialization. The extension type name unambiguously
> designates the abstract extension type (such as "Tensor"); the
> serialization optionally encodes the extension type's parameters, if
> it has any (such as the dimensionality for a "Tensor" type).
> 
> Initially, Arrow extension types tended to be ad hoc and
> application-specific, but there is a growing trend to standardize
> "canonical extension types" to allow for better data interoperability
> accross widely-used data types [2].
> 
> From my experience as an Arrow PMC member, if Arrow didn't have
> extension types, the barrier to propose and standardize new data types
> would be much higher, especially for complex proposals such as the
> fixed-shape and variable-shape tensor types.
> 
> 
> For Parquet, extension types would be an alternative to enchristening
> additional logical types in the Thrift specification. I can see several
> advantages to extension types over additional logical types:
> 
> 1) extension types would make it easier to experiment in dedicated
> communities, trying to find out the best possible representation for
> some kinds of data (example: the Geoparquet work)
> 
> 2) extension types would allow "soft standardization": an extension type
> could first be formally defined by a dedicated community, then
> optionally find an official place under the Parquet project.
> 
> 3) extension types would allow defining complex data representations
> and semantics without imposing a large burden on the developers of
> Parquet implementations, who may not be competent in the target domain.
> This includes non-trivial statistics such as bounding boxes for
> geospatial data.
> 
> 
> Technically, I can imagine two possible ways of adding extension types
> to the Parquet format:
> 
> 1) as an additional logical type;
> 2) as a separate type determination, in addition to the logical type.
> 
> We should also ensure it is possible to express extension-specific
> statistics (such as bounding boxes for geospatial data).
> 
> What do you think?
> 
> Regards
> 
> Antoine.
> 
> 
> [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> 
> [2]
> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html
> 
> 
> 
> 



Reply via email to