Hello,

(NOTE: this comes in the context of
https://github.com/apache/parquet-format/pull/240 --
"PARQUET-2471: Add geometry logical type")

I'd like to launch a discussion about the possible addition of
extension types in Parquet.

Extension types are a concept borrowed from the Arrow type system [1].
They provide a standard way of conveying more precise information about
the intended type and usage of a given column, without requiring the
metadata format to have a dedicated serialization for that type.

In Arrow, extension types are typically conveyed through two
string/binary parameters: 1) the extension type name; 2) the
type-specific serialization. The extension type name unambiguously
designates the abstract extension type (such as "Tensor"); the
serialization optionally encodes the extension type's parameters, if
it has any (such as the dimensionality for a "Tensor" type).

Initially, Arrow extension types tended to be ad hoc and
application-specific, but there is a growing trend to standardize
"canonical extension types" to allow for better data interoperability
accross widely-used data types [2].

From my experience as an Arrow PMC member, if Arrow didn't have
extension types, the barrier to propose and standardize new data types
would be much higher, especially for complex proposals such as the
fixed-shape and variable-shape tensor types.


For Parquet, extension types would be an alternative to enchristening
additional logical types in the Thrift specification. I can see several
advantages to extension types over additional logical types:

1) extension types would make it easier to experiment in dedicated
communities, trying to find out the best possible representation for
some kinds of data (example: the Geoparquet work)

2) extension types would allow "soft standardization": an extension type
could first be formally defined by a dedicated community, then
optionally find an official place under the Parquet project.

3) extension types would allow defining complex data representations
and semantics without imposing a large burden on the developers of
Parquet implementations, who may not be competent in the target domain.
This includes non-trivial statistics such as bounding boxes for
geospatial data.


Technically, I can imagine two possible ways of adding extension types
to the Parquet format:

1) as an additional logical type;
2) as a separate type determination, in addition to the logical type.

We should also ensure it is possible to express extension-specific
statistics (such as bounding boxes for geospatial data).

What do you think?

Regards

Antoine.


[1] https://arrow.apache.org/docs/format/Columnar.html#extension-types

[2]
https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html



Reply via email to