I like the idea of an EXTENSION logical type (Antoine's option 1).
Perhaps the stats ordering could be left as an implementation
detail...those implementations that understand the new type will
implicitly know the proper ordering. Once the type graduates to full
logical type status, the ColumnOrdering could be updated if necessary
for the new type. Implementations that don't know the type will ignore
the statistics.
Ed
On 5/28/24 7:58 AM, Antoine Pitrou wrote:
Hi Gabor,
Perhaps we can eschew this problem by having a separate "extension
statistics" field that does not mandate total ordering?
Regards
Antoine.
On Tue, 28 May 2024 16:54:49 +0200
Gábor Szádovszky <[email protected]> wrote:
Hi Antoine,
One quick note about this. Parquet min/max statistics need a total ordering
for each logical type. Without that we either use some default based on the
primitive type (that might not be suitable for the related extension type)
or we won't store min/max statistics for the related values. It means no
min/max stats for the row group nor page indices.
So, I guess, we would need a way to define total ordering for an extension
type. Does not sound like an easy topic.
Cheers,
Gabor
Antoine Pitrou <[email protected]> ezt írta (időpont: 2024. máj. 28., K,
16:45):
Hello,
(NOTE: this comes in the context of
https://github.com/apache/parquet-format/pull/240 --
"PARQUET-2471: Add geometry logical type")
I'd like to launch a discussion about the possible addition of
extension types in Parquet.
Extension types are a concept borrowed from the Arrow type system [1].
They provide a standard way of conveying more precise information about
the intended type and usage of a given column, without requiring the
metadata format to have a dedicated serialization for that type.
In Arrow, extension types are typically conveyed through two
string/binary parameters: 1) the extension type name; 2) the
type-specific serialization. The extension type name unambiguously
designates the abstract extension type (such as "Tensor"); the
serialization optionally encodes the extension type's parameters, if
it has any (such as the dimensionality for a "Tensor" type).
Initially, Arrow extension types tended to be ad hoc and
application-specific, but there is a growing trend to standardize
"canonical extension types" to allow for better data interoperability
accross widely-used data types [2].
From my experience as an Arrow PMC member, if Arrow didn't have
extension types, the barrier to propose and standardize new data types
would be much higher, especially for complex proposals such as the
fixed-shape and variable-shape tensor types.
For Parquet, extension types would be an alternative to enchristening
additional logical types in the Thrift specification. I can see several
advantages to extension types over additional logical types:
1) extension types would make it easier to experiment in dedicated
communities, trying to find out the best possible representation for
some kinds of data (example: the Geoparquet work)
2) extension types would allow "soft standardization": an extension type
could first be formally defined by a dedicated community, then
optionally find an official place under the Parquet project.
3) extension types would allow defining complex data representations
and semantics without imposing a large burden on the developers of
Parquet implementations, who may not be competent in the target domain.
This includes non-trivial statistics such as bounding boxes for
geospatial data.
Technically, I can imagine two possible ways of adding extension types
to the Parquet format:
1) as an additional logical type;
2) as a separate type determination, in addition to the logical type.
We should also ensure it is possible to express extension-specific
statistics (such as bounding boxes for geospatial data).
What do you think?
Regards
Antoine.
[1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
[2]
https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html