Re: Parquet sync tomorrow Wednesday Apr 22nd

Julien Le Dem Wed, 22 Apr 2026 16:02:38 -0700

Notes from the meeting:
https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub
Attendees:

Micah Kornfield - Databricks - Listening in
-

Neelesh Salian - Apple - Variant related items
-

Robert Kruszewski - Spiral - Listening in
-

Martin Prammer - Spiral - Listening in
-

Gunnar Morling - Confluent - Listening in
-

Kenny Daniel - Hyperparam - Listening
-

Divjot Arora - Databricks - Flatbuf footer
-

Jiayi Wang - backward-compatible VS incompatible changes (part of
flatbuf discussion)
-

Ismaël Mejía - Microsoft - Java Encoding/Decoding perf
-

Anurag Mantripragada - Apple - Listening in - Variant stuff

Rok Mihevc: G-Research/Arctos Alliance <https://arctosalliance.org/>,
Flatbuffers, FIXED_SIZE_LIST/VECTOR proposal
-

Prateek - Snowflake - Listening in
-

Benjamin Owad - Snowflake - Listening in

Dusan Paripovic - RTE , listening in
-

Will Edwards - Spotify - Listening in
-

Raúl Cumplido - QuantStack - Listening in
-

Steve Loughran: Variant performance update (good!)
-

Mengmeng Chen - Snowflake - listening in
-

Rahil Chertara - Onehouse - listening in

Agenda:

[Neelesh Salian + Steve Loughran] Variant related items
-

Iceberg - Variant Community Update

<https://docs.google.com/document/d/1IuhLRxw1rcPD_f4jgHuGe3SwFgy7Y5wgEGvLzf6311s/edit?tab=t.froqj7pg3868#heading=h.r977qio1wsv2>(Parquet
items as well)
-

See doc for Iceberg, Spark and Parquet related items
-

PRs open for lazy caching…(
https://github.com/apache/parquet-java/pull/3481)
-

If you want to help, please reach out! Help welcome. Tracker and
benchmark in the doc.
-

[Ismael] Java Encoding/Decoding ask for review
-

Experimenting with improving open source libraries with AI.
-

Based on existing benchmarks.
-

Performance tests and PRs.
-

Avg 40% improvement on encodings. (write path)
-

10% on read path.
-

PRs have been reviewed by ismael: not just ai generated.
-

Need help with reviews from maintainers.
-

https://github.com/apache/parquet-java/pull/3512
-

Gunnar: I've been working on a new Parquet Parser (presented it to
the group a few weeks back, https://github.com/hardwood-hq/hardwood);
solely focused on parsing atm., i.e. decoding. Would love to learn about
any improvements in that area, will check out your PRs.
-

[Divjot + Jiayi + Rok] Flatbuffer footer
-

Ref to mailing list thread regarding building bw compatible indices
on thrift footer.
-

Goal to give faster random access in metadata.
-

2 options:
-

Incremental updates: Index on footer + reducing bloat by removing
less useful metadata.
-

PR <https://github.com/apache/parquet-format/pull/564> to make
path_in_schema optional
-

Bigger rewrite with roll out plan: New Flatbuffer based footer.
-

Open items:
-

Handling thrift schema evolution, making fields optional to
deprecate.
-

Discuss increased complexity of thrift jump tables.
-

Finalizing plan for the flatbuffer footer.
-

Flatbuffer at prototype state?
-

Proposal:
-

1) replace everything as in the current proposal
-

2) make it minimal and more modular with extensions.
-

We have some internal benchmarks that show that most footers are
actually smaller when using FlatBuffers after removing bloat unuseful
fields. If there's some public e2e benchmarks, let me know.
But of course,
only readers that adopt flatbuf footer can benefit from it.
-

Kenny: That assumes making the breaking change of dropping thrift.
If we stay in a backward compat world then we need both flat
and thrift.
That makes files (and parsers) much larger more complicated.
I personally
hate the idea of dropping thrift as it will break a lot of
systems. Making
a big breaking change is an existential risk to parquet... if
its going to
be a hard break why wouldnt users consider alternatives at
that point? I
like the idea of optimizing thrift much more than flatbuffer,
personally.
-

Gunnar Morling: Yeah, similar sentiment here
-

Robert: How about embedding Vortex?
-

Stated goal not to embed opaque encodings, schemes.
-

Embed vortex flatbuffer footer
-

Readers who can parse the footer can treat the opaque
encoding as transparent
-

Input from other projects is welcome.
-

TODO:
-

Shared doc to articulate
-

Jiayi, Divjot, Will, Gunnar, Alkis, Robert, Rok
-

Content:
-

Describe the problem: large footer, wide schema
-

Can have big footer with many row groups as well.
-

Describe what’s pathological
-

Describe the options at a high level, point to detailed docs
of POC/proposals.
-

Useful to share files with the problem.
-

Difficult
-

Regular meeting. Jiayi: facilitator
-

[Rok] FIXED_SIZE_LIST/VECTOR proposal
-

This is still ongoing.
-

3 options, will write a doc and report to the mailing list.
-

Use case: efficiently store Vectors
-

Micah: how about adding a 4th option: new logical type vector that
annotates the existing FLBA type (?) => know you don’t have to read
Repetition Levels.
-

Rahil: similar to what is being done in Hudi.
-

Need to discuss dense vectors vs sparse vectors.

On Tue, Apr 21, 2026 at 2:53 PM Julien Le Dem <[email protected]> wrote:

> The next Parquet sync is tomorrow Wednesday Apr 22nd at 10am PT - 1pm ET
> - 7pm CET
>
> To join the invite, join the group:
> https://groups.google.com/g/apache-parquet-community-sync
>
> Everybody is welcome, bring your topic or just listen in.
>
> (Some more details on how the meeting is run:
> https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
>

Re: Parquet sync tomorrow Wednesday Apr 22nd

Reply via email to