Meeting notes:
Attendees:
Rok: contributor to Arrow, encryption, Rust
Gabor: Dremio, topic: Variant.
Fokko: Databricks
Dan: Databricks, topic: Variant Geo types
Kenny: hyparquet (js)
Gene: Databricks, topic: Variant
Andrew: Influx Data, rust parquet maintainer, data fusion. topic: Variant
in RUST
Ashish: Sumo Logic, listen in
Micah: Google
Neil: Snowflake, variant C++
Ryan: Databricks, topic: variant, geo
Aihua: Snowflake, topic: variant
Dewey: topic: PR open Geometry (C++, RUST)
Nong: Databricks
Agenda/Notes:
-
Geo types:
-
Geo implementations:
-
C++: https://github.com/apache/arrow/pull/45459
-
Java: https://github.com/apache/parquet-java/pull/2971
-
Update
- Geometry
- Geography: Stats TBD
- Java:
-
Christian and Fend have been working on the java implementation
-
Need a release
-
Fuzz testing
-
Getting a lot of feedback. Thanks!
-
Definition of the stats: in thrift with clear language.
-
Enable bounding box that go over the 0 line. (Fiji)
-
Don’t want stats that lie. Bad stats, bad data
-
Variant
-
Rust impl: https://github.com/apache/arrow-rs/issues/6736
-
Need: Unblock variant annotation in the java library
-
Finalize outstanding discussions
-
Versioning in Variant annotation => action item
-
What’s remaining to finalize the spec.
-
C++ and Java implementations
-
Java impl in iceberg, moving to Parquet
-
Impls:
-
2 working java implementation
-
Spark Java implementation
<https://github.com/apache/spark/tree/master/common/variant/src/main/java/org/apache/spark/types/variant>
(binary, shredding)
-
Spark Python implementation
<https://github.com/apache/spark/blob/master/python/pyspark/sql/variant_utils.py>
(binary)
-
parquet-java implementation PR
<https://github.com/apache/parquet-java/pull/3117>
(binary)
-
C++ impl <https://github.com/apache/arrow/pull/45375>
-
2 private ones (Snowflake, Databricks(c++, binary,
shredding) )
-
Lower priority: How to shred?
-
You cannot add columns after you instantiate the writer.
-
Could extend writer but collides with encryption
-
Adding columns for parquet schema in the middle of writing
invalidates encryption
-
Shredding released at the same time as the binary variant.
-
Dangerous to do shredding as a follow up
-
Tiny PR for the spec: GH-486: Variant object shredding without
field shredding <https://github.com/apache/parquet-format/pull/487>
-
Compatibility across implementations => Action item
-
Goal:
-
Combined Variant and shredding release
-
Do we require support for shredding?
-
Variant with shredding is not a separate type.
-
Did we agree to roll them out together?
-
We agree that we want to roll out together to reduce
potential inconsistencies in implementations. => Action item
-
Requirements for considering it ready to release:
-
Need examples data for parquet data.
-
Versioning of variant spec
-
https://github.com/apache/parquet-format/pull/474
Action items
- [image: unchecked]
Julien, Ryan, Micah, Aihua: Follow up on email thread on the
parquet-format type annotation for shredding, how we make it easy to work
on implementation without fuzzy communication on releases
- [image: unchecked]
Andrew: follow up on the cross implementation testing
- [image: unchecked]
Micah, Ryan, Dan: Finalize type annotation versioning discussion on PR
474
-
[image: unchecked]Ryan, email about decision to release sharedding with
Variant.
On Tue, Mar 4, 2025 at 6:13 PM Julien Le Dem <[email protected]> wrote:
> The next Parquet sync is tomorrow Mar 5th at 9:30am PT - 12:30pm ET -
> 6:30pm CET
> To join the invite:
> https://calendar.app.google/WTQgodyxSmBUimXT8
> Please contact me to be added to the recurring invite. (every two weeks)
> Everybody is welcome, bring your topic or just listen in.
> Best
> Julien
>