Re: Parquet sync today Wednesday May 6th

Julien Le Dem Wed, 06 May 2026 13:41:44 -0700

Meeting Notes:
https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub
Attendees:

Julien Le Dem: Datadog, interested in updates on ongoing projects (ALP,
fixed-length-array, metadata, …) and next release
-

Dusan Paripovic RTE, listening in
-

Neelesh Salian Apple, listening in
-

Martin Prammer: Spiral, Datasets Project (Raincloud)
-

Connor Tsui: Spiral, listening in
-

Andrew Lamb (InfluxData), listening in
-

Ismaël Mejía: Microsoft, performance improvements on Java
-

Rok Mihevc: G-Research/Arctos Alliance <https://arctosalliance.org/>,
Flatbuffers, vector-like datatype proposal
-

Kenny Daniel, Hyperparam, listening
-

Russell Spitzer, Snowflake, Listening in
-

Will Edwards: Spotify, listening in
-

Robert Kruszewski: Spiral, listening in
-

Amogh Jahagirdar: Databricks, listening in
-

Micah Kornfield - Databricks Listening
-

Daniel Weeks: Databricks, format improvements (footer, encodings, pages,
types)
-

Arnav Balyan - FSST
-

Jiayi Wang - Databricks
-

Benjamin Owad - Snowflake (listening in)
-

Ashish Paliwal - Apple (listening in)
-

Jiaying Li = Snowflake (listening in)

Agenda:

Parquet-java Release.
-

Err on the side of releasing often
-

Gang is helping to make the release
-

Ideally, this is more automated. Apache infra working on this.
-

Small group of release managers per project.
-

TODO: start 2 threads
-

Russel: release automation
-

Raising your hand to shepherd release scope definition.
-

[Andrew, Julien] Finalizing ALP (Floating Point Encoding for Parquet).
-

Mailing list:
https://lists.apache.org/thread/cg68jco16ltqs6xrwphol5co8o2yjhpf
-

Andrew/Micah/Antoine reviewing the spec:
https://github.com/apache/parquet-format/pull/557
-

Parquet-format examples in
https://github.com/apache/parquet-testing/pull/100 (Andrew thinks
they are are quite large)
-

Needs: Reviews of C++ implementation and Java implementation
-

Andrew: will review Rust implementation “soon”
https://github.com/apache/arrow-rs/pull/9372
-

Hyparquet (javascript) ALP branch:
https://github.com/hyparam/hyparquet/pull/161
-

Roll out plan:
-

Using ALP is behind a flag to enable read but write
-

Model of fairly granular releases? 1 thing at a time. Example of
iceberg model.
-

Here is the current implementation status page (notes dates when
implementations supported the feature, not versions):
https://parquet.apache.org/docs/file-format/implementationstatus/
-

See “Minimum Version for Read Support by Year” table as the
current “state of the art”
-

Decision points:
-

Is it linear or not: supporting Vx, means supporting everything
before
-

Feature flags that turn into a v.
-

TODO:
-

Thread:
-

Review existing process voted upon
https://lists.apache.org/thread/nq7n6pbp222txrfo232ybgpvlvpmykbp
and see what is missing
-

Clarify the feature flag as a standard across
implementations.
-

Clarify how alp becomes on by default.
-

Not a blocker to releasing ALP behind a flag.
-

Pending validation:
-

C++ unsigned ints. => need more test on the java perspective.
-

Parquet metadata (footer work) progress.
-

Jiayi: experiments, making every field modular
-

Flatbuff might not be necessary.
-

TODO:
-

Discuss further in the working group.
-

Update on the list.
-

[Weeks] Parquet: Non-contiguous Pages

<https://docs.google.com/document/d/1nntcYM98PFSkHT70RexSBPtCnWqg1uRJ5_7m--ZgbsA>
-

TODO: look at the proposal posted on the list above.
-

There is interest in the project to address this head on.
-

The problem of asymmetric column size is not new.
-

[Rok] Discuss new vector-like datatype proposal - Parquet fixed-size
list type

<https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?tab=t.0>
or reply on the ML
https://lists.apache.org/thread/rolncdtobpmdqmqcr3ry087yhfw210l3
-

3-4 proposals discussed so far.
-

The doc has an analysis of the pros and cons of each.
-

TODO:
-

Please comment in the doc with feedback. We’ll discuss next time
with the goal of making a decision.
-

Preferred option: New vector_repetition type.
-

Improve on read and write. Not writing repetition_level. Still
has nullability info (optional).
-

Daniel to ask question in the doc regarding whether this work well
with encodings.
-

O(1) read constraint?
-

[Ismael] Java encoding/decoding performance.
-

15 PRs (5 more yay!), 3 approved, 5 in progress, 7 to be reviewed
-

https://github.com/apache/parquet-java/issues/3530
-

PRs have been broken up for easier and more efficient review.
-

TODO:
-

Need reviews! Pretty please 🙂
-

Merge approved PRs.
-

Feel free to reply to the thread on the ML.
-

[Martin] Datasets Project - https://github.com/spiraldb/raincloud
-

Make it easy to evaluate file formats in a reproducible framework
with public datasets.
-

Make sure to cover all types and encodings for validating Parquet.
Not necessarily scale like TPCH.
-

Don’t want to redistribute someone else's crawled data because of
licensing constraints.
-

Future goal to generate good parquet datasets for evaluation.
-

Feedback welcome on where this is going next.
-

Ex: Not pick one implementation over another.
-

[Will E] effort to add defensive validating in mainstream open source
readers?
-

Have more formalized list of checks that readers should have to have
better errors and dealing with forward compat and introducing breaking
changes gracefully.

On Wed, May 6, 2026 at 7:20 AM Julien Le Dem <[email protected]> wrote:

> The next Parquet sync is today Wednesday May 6th at 10am PT - 1pm ET -
> 7pm CET (in ~2.5h)
>
> To join the invite, join the group:
> https://groups.google.com/g/apache-parquet-community-sync
>
> Everybody is welcome, bring your topic or just listen in.
>
> (Some more details on how the meeting is run:
> https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
>

Re: Parquet sync today Wednesday May 6th

Reply via email to