Hello,
(NOTE: this comes in the context of https://github.com/apache/parquet-format/pull/240 -- "PARQUET-2471: Add geometry logical type") I'd like to launch a discussion about the possible addition of extension types in Parquet. Extension types are a concept borrowed from the Arrow type system [1]. They provide a standard way of conveying more precise information about the intended type and usage of a given column, without requiring the metadata format to have a dedicated serialization for that type. In Arrow, extension types are typically conveyed through two string/binary parameters: 1) the extension type name; 2) the type-specific serialization. The extension type name unambiguously designates the abstract extension type (such as "Tensor"); the serialization optionally encodes the extension type's parameters, if it has any (such as the dimensionality for a "Tensor" type). Initially, Arrow extension types tended to be ad hoc and application-specific, but there is a growing trend to standardize "canonical extension types" to allow for better data interoperability accross widely-used data types [2]. From my experience as an Arrow PMC member, if Arrow didn't have extension types, the barrier to propose and standardize new data types would be much higher, especially for complex proposals such as the fixed-shape and variable-shape tensor types. For Parquet, extension types would be an alternative to enchristening additional logical types in the Thrift specification. I can see several advantages to extension types over additional logical types: 1) extension types would make it easier to experiment in dedicated communities, trying to find out the best possible representation for some kinds of data (example: the Geoparquet work) 2) extension types would allow "soft standardization": an extension type could first be formally defined by a dedicated community, then optionally find an official place under the Parquet project. 3) extension types would allow defining complex data representations and semantics without imposing a large burden on the developers of Parquet implementations, who may not be competent in the target domain. This includes non-trivial statistics such as bounding boxes for geospatial data. Technically, I can imagine two possible ways of adding extension types to the Parquet format: 1) as an additional logical type; 2) as a separate type determination, in addition to the logical type. We should also ensure it is possible to express extension-specific statistics (such as bounding boxes for geospatial data). What do you think? Regards Antoine. [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types [2] https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html