Hi Gabor,
Perhaps we can eschew this problem by having a separate "extension statistics" field that does not mandate total ordering? Regards Antoine. On Tue, 28 May 2024 16:54:49 +0200 Gábor Szádovszky <[email protected]> wrote: > Hi Antoine, > > One quick note about this. Parquet min/max statistics need a total ordering > for each logical type. Without that we either use some default based on the > primitive type (that might not be suitable for the related extension type) > or we won't store min/max statistics for the related values. It means no > min/max stats for the row group nor page indices. > So, I guess, we would need a way to define total ordering for an extension > type. Does not sound like an easy topic. > > Cheers, > Gabor > > Antoine Pitrou <[email protected]> ezt írta (időpont: 2024. máj. 28., K, > 16:45): > > > > > Hello, > > > > (NOTE: this comes in the context of > > https://github.com/apache/parquet-format/pull/240 -- > > "PARQUET-2471: Add geometry logical type") > > > > I'd like to launch a discussion about the possible addition of > > extension types in Parquet. > > > > Extension types are a concept borrowed from the Arrow type system [1]. > > They provide a standard way of conveying more precise information about > > the intended type and usage of a given column, without requiring the > > metadata format to have a dedicated serialization for that type. > > > > In Arrow, extension types are typically conveyed through two > > string/binary parameters: 1) the extension type name; 2) the > > type-specific serialization. The extension type name unambiguously > > designates the abstract extension type (such as "Tensor"); the > > serialization optionally encodes the extension type's parameters, if > > it has any (such as the dimensionality for a "Tensor" type). > > > > Initially, Arrow extension types tended to be ad hoc and > > application-specific, but there is a growing trend to standardize > > "canonical extension types" to allow for better data interoperability > > accross widely-used data types [2]. > > > > From my experience as an Arrow PMC member, if Arrow didn't have > > extension types, the barrier to propose and standardize new data types > > would be much higher, especially for complex proposals such as the > > fixed-shape and variable-shape tensor types. > > > > > > For Parquet, extension types would be an alternative to enchristening > > additional logical types in the Thrift specification. I can see several > > advantages to extension types over additional logical types: > > > > 1) extension types would make it easier to experiment in dedicated > > communities, trying to find out the best possible representation for > > some kinds of data (example: the Geoparquet work) > > > > 2) extension types would allow "soft standardization": an extension type > > could first be formally defined by a dedicated community, then > > optionally find an official place under the Parquet project. > > > > 3) extension types would allow defining complex data representations > > and semantics without imposing a large burden on the developers of > > Parquet implementations, who may not be competent in the target domain. > > This includes non-trivial statistics such as bounding boxes for > > geospatial data. > > > > > > Technically, I can imagine two possible ways of adding extension types > > to the Parquet format: > > > > 1) as an additional logical type; > > 2) as a separate type determination, in addition to the logical type. > > > > We should also ensure it is possible to express extension-specific > > statistics (such as bounding boxes for geospatial data). > > > > What do you think? > > > > Regards > > > > Antoine. > > > > > > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types > > > > [2] > > https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html > > > > > > > > >
