Jan 8th 2025
Attendees:
-
Julien: Datadog, interested in updates
-
Micah: Google
-
Andrew L: InfluxData, lurking
-
Antoine: QuantStack, curious about Parquet 3 updates, Parquet C++ updates
-
Russell: Snowflake, Listen to Micah talk to me about shredding, Geometry
-
Alkis: DataBricks,
-
Ashish: Sumo Logic, listen in
-
Dewey: Wherobots, update on Geometry data type
-
Daniel: Databricks, Variant shredding, Geometry
-
Rok: listening in, footer
-
Andrew B: point cloud in Parquet
Agenda:
-
Parquet C++, quick update => Antoine
-
New footer: quick update from Alkis
-
Variant Shredding
-
Geometry types => Dewey
-
Iceberg Geometry/Geography type PR:
https://github.com/apache/iceberg/pull/10981
-
Parquet Geometry type PR:
https://github.com/apache/parquet-format/pull/240
Notes:
-
Parquet C++, quick update => Antoine
-
Gang Wu implemented new statistics in Cpp. (slight overhead: off by
default)
-
https://github.com/apache/arrow/pull/40594
-
Goal to improve performance so that it can be enabled by default.
-
https://github.com/apache/arrow/pull/45202
-
Extension types:
-
Would like to revive the extension types proposal in the future
(interest from Micah and Antoine)
-
New footer update:
-
Some metadata has been removed in the prototype (for compactness).
Some readers need that.
-
Ex: Doesn’t have the converted types.
-
early version of experimental flatbuf footer:
https://github.com/apache/arrow/pull/43793
-
Variant Shredding:
-
Outstanding issues:
-
Micah: Specification is somewhat arbitrary on how to handle
invalid Parquet files (whether shredded or unshredded columns take
precedence)
-
Doesn’t really belong in the spec.
-
Reader behavior should be unspecified if the writer generates
an invalid file.
-
Whether you read the shredded or unshredded data depends on the
query. So saying one takes precedence is not really feasible in a
performant way.
-
Russel agrees with leaving it as undefined.
-
Julien agrees.
-
Shredding happens in a single pass. It allows shredding a column
that is not all the same type so that we don’t need to backtrack.
-
In meeting:
-
consensus that leaving it undefined is better.
-
We should error out in invalid cases as much as we can.
-
The spec should be very clear on invalid things that should not
be allowed.
-
TODO: Daniel to follow up with Ryan, to wrap this up.
-
What other implementation of Variant do we need to finalize?
-
Parquet java
-
Another different language: C++ (arrow/cpp), Rust (arrow-rs), Go…?
-
Discussion in Arrow to have the Variant extension type?
-
https://github.com/apache/arrow/issues/42069
-
Separate nascent effort of Iceberg C++
-
Ticket tracking adding variant in Rust (arrow-rs):
https://github.com/apache/arrow-rs/issues/6736
-
Enabler:
-
Produce data files to enable cross-compatibility tests.
-
TODO(Daniel W): follow with Fokko on leading a rust implementation.
-
Iceberg rust uses the parquet implementation from arrow-rs:
https://github.com/apache/iceberg-rust/blob/6e07faacd7734886718ce544e40599eb2ce939e3/Cargo.toml#L79
-
TODO(Daniel W): explore the feasibility of C++ implementation of
unshredded Variant.
-
TODO(Daniel W): follow up with Ryan Blue for a plan for the
non-java (arrow/cpp or arrow-rs) implementation and follow up
on mailing
list
On Wed, Jan 8, 2025 at 7:38 AM Julien Le Dem <[email protected]> wrote:
> The next Parquet sync is today Jan 8th at 9:30am PT - 12:30pm ET - 6:30pm
> CET
> To join the invite:
> https://calendar.app.google/uTqCRtdDFMAGttwY8
> Please contact me to be added to the recurring invite.
> Everybody is welcome, bring your topic or just listen in.
> Best
> Julien
>