Re: [DISCUSS] Extension types in Parquet?

Antoine Pitrou Tue, 28 May 2024 07:58:48 -0700


Hi Gabor,


Perhaps we can eschew this problem by having a separate "extension
statistics" field that does not mandate total ordering?

Regards

Antoine.


On Tue, 28 May 2024 16:54:49 +0200
Gábor Szádovszky <[email protected]> wrote:
> Hi Antoine,
> 
> One quick note about this. Parquet min/max statistics need a total ordering
> for each logical type. Without that we either use some default based on the
> primitive type (that might not be suitable for the related extension type)
> or we won't store min/max statistics for the related values. It means no
> min/max stats for the row group nor page indices.
> So, I guess, we would need a way to define total ordering for an extension
> type. Does not sound like an easy topic.
> 
> Cheers,
> Gabor
> 
> Antoine Pitrou <[email protected]> ezt írta (időpont: 2024. máj. 28., K,
> 16:45):
> 
> >
> > Hello,
> >
> > (NOTE: this comes in the context of
> > https://github.com/apache/parquet-format/pull/240 --
> > "PARQUET-2471: Add geometry logical type")
> >
> > I'd like to launch a discussion about the possible addition of
> > extension types in Parquet.
> >
> > Extension types are a concept borrowed from the Arrow type system [1].
> > They provide a standard way of conveying more precise information about
> > the intended type and usage of a given column, without requiring the
> > metadata format to have a dedicated serialization for that type.
> >
> > In Arrow, extension types are typically conveyed through two
> > string/binary parameters: 1) the extension type name; 2) the
> > type-specific serialization. The extension type name unambiguously
> > designates the abstract extension type (such as "Tensor"); the
> > serialization optionally encodes the extension type's parameters, if
> > it has any (such as the dimensionality for a "Tensor" type).
> >
> > Initially, Arrow extension types tended to be ad hoc and
> > application-specific, but there is a growing trend to standardize
> > "canonical extension types" to allow for better data interoperability
> > accross widely-used data types [2].
> >
> > From my experience as an Arrow PMC member, if Arrow didn't have
> > extension types, the barrier to propose and standardize new data types
> > would be much higher, especially for complex proposals such as the
> > fixed-shape and variable-shape tensor types.
> >
> >
> > For Parquet, extension types would be an alternative to enchristening
> > additional logical types in the Thrift specification. I can see several
> > advantages to extension types over additional logical types:
> >
> > 1) extension types would make it easier to experiment in dedicated
> > communities, trying to find out the best possible representation for
> > some kinds of data (example: the Geoparquet work)
> >
> > 2) extension types would allow "soft standardization": an extension type
> > could first be formally defined by a dedicated community, then
> > optionally find an official place under the Parquet project.
> >
> > 3) extension types would allow defining complex data representations
> > and semantics without imposing a large burden on the developers of
> > Parquet implementations, who may not be competent in the target domain.
> > This includes non-trivial statistics such as bounding boxes for
> > geospatial data.
> >
> >
> > Technically, I can imagine two possible ways of adding extension types
> > to the Parquet format:
> >
> > 1) as an additional logical type;
> > 2) as a separate type determination, in addition to the logical type.
> >
> > We should also ensure it is possible to express extension-specific
> > statistics (such as bounding boxes for geospatial data).
> >
> > What do you think?
> >
> > Regards
> >
> > Antoine.
> >
> >
> > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> >
> > [2]
> > https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html
> >
> >
> >
> >  
>

Re: [DISCUSS] Extension types in Parquet?

Reply via email to