Re: Parquet sync tomorrow Wednesday Apr 22nd

Manu Zhang Wed, 22 Apr 2026 20:10:36 -0700

Hi Julien,

Thanks for the meeting notes. I wasn't able to attend. Did you discuss a
new parquet-java release?


Regards,
Manu

On Thu, Apr 23, 2026 at 7:02 AM Julien Le Dem <[email protected]> wrote:

> Notes from the meeting:
>
> https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub
> Attendees:
>
>    -
>
>    Micah Kornfield - Databricks - Listening in
>    -
>
>    Neelesh Salian - Apple - Variant related items
>    -
>
>    Robert Kruszewski - Spiral - Listening in
>    -
>
>    Martin Prammer - Spiral - Listening in
>    -
>
>    Gunnar Morling - Confluent - Listening in
>    -
>
>    Kenny Daniel - Hyperparam - Listening
>    -
>
>    Divjot Arora - Databricks - Flatbuf footer
>    -
>
>    Jiayi Wang - backward-compatible VS incompatible changes (part of
>    flatbuf discussion)
>    -
>
>    Ismaël Mejía - Microsoft - Java Encoding/Decoding perf
>    -
>
>    Anurag Mantripragada - Apple - Listening in - Variant stuff
>
>
>    -
>
>    Rok Mihevc: G-Research/Arctos Alliance <https://arctosalliance.org/>,
>    Flatbuffers, FIXED_SIZE_LIST/VECTOR proposal
>    -
>
>    Prateek - Snowflake - Listening in
>    -
>
>    Benjamin Owad - Snowflake - Listening in
>
>
>    -
>
>    Dusan Paripovic - RTE , listening in
>    -
>
>    Will Edwards - Spotify - Listening in
>    -
>
>    Raúl Cumplido - QuantStack - Listening in
>    -
>
>    Steve Loughran: Variant performance update (good!)
>    -
>
>    Mengmeng Chen - Snowflake - listening in
>    -
>
>    Rahil Chertara - Onehouse - listening in
>
>
> Agenda:
>
>    -
>
>    [Neelesh Salian + Steve Loughran] Variant related items
>    -
>
>       Iceberg - Variant Community Update
>       <
> https://docs.google.com/document/d/1IuhLRxw1rcPD_f4jgHuGe3SwFgy7Y5wgEGvLzf6311s/edit?tab=t.froqj7pg3868#heading=h.r977qio1wsv2
> >(Parquet
>       items as well)
>       -
>
>       See doc for Iceberg, Spark and Parquet related items
>       -
>
>       PRs open for lazy caching…(
>       https://github.com/apache/parquet-java/pull/3481)
>       -
>
>       If you want to help, please reach out! Help welcome. Tracker and
>       benchmark in the doc.
>       -
>
>    [Ismael] Java Encoding/Decoding ask for review
>    -
>
>       Experimenting with improving open source libraries with AI.
>       -
>
>       Based on existing benchmarks.
>       -
>
>       Performance tests and PRs.
>       -
>
>       Avg 40% improvement on encodings. (write path)
>       -
>
>       10% on read path.
>       -
>
>       PRs have been reviewed by ismael: not just ai generated.
>       -
>
>       Need help with reviews from maintainers.
>       -
>
>          https://github.com/apache/parquet-java/pull/3512
>          -
>
>       Gunnar: I've been working on a new Parquet Parser (presented it to
>       the group a few weeks back, https://github.com/hardwood-hq/hardwood
> );
>       solely focused on parsing atm., i.e. decoding. Would love to learn
> about
>       any improvements in that area, will check out your PRs.
>       -
>
>    [Divjot + Jiayi + Rok] Flatbuffer footer
>    -
>
>       Ref to mailing list thread regarding building bw compatible indices
>       on thrift footer.
>       -
>
>       Goal to give faster random access in metadata.
>       -
>
>       2 options:
>       -
>
>          Incremental updates: Index on footer + reducing bloat by removing
>          less useful metadata.
>          -
>
>             PR <https://github.com/apache/parquet-format/pull/564> to make
>             path_in_schema optional
>             -
>
>          Bigger rewrite with roll out plan: New Flatbuffer based footer.
>          -
>
>       Open items:
>       -
>
>          Handling thrift schema evolution, making fields optional to
>          deprecate.
>          -
>
>          Discuss increased complexity of thrift jump tables.
>          -
>
>          Finalizing plan for the flatbuffer footer.
>          -
>
>             Flatbuffer at prototype state?
>             -
>
>             Proposal:
>             -
>
>                1) replace everything as in the current proposal
>                -
>
>                2) make it minimal and more modular with extensions.
>                -
>
>          We have some internal benchmarks that show that most footers are
>          actually smaller when using FlatBuffers after removing bloat
> unuseful
>          fields. If there's some public e2e benchmarks, let me know.
> But of course,
>          only readers that adopt flatbuf footer can benefit from it.
>          -
>
>          Kenny: That assumes making the breaking change of dropping thrift.
>          If we stay in a backward compat world then we need both flat
> and thrift.
>          That makes files (and parsers) much larger more complicated.
> I personally
>          hate the idea of dropping thrift as it will break a lot of
> systems. Making
>          a big breaking change is an existential risk to parquet... if
> its going to
>          be a hard break why wouldnt users consider alternatives at
> that point? I
>          like the idea of optimizing thrift much more than flatbuffer,
> personally.
>          -
>
>          Gunnar Morling: Yeah, similar sentiment here
>          -
>
>          Robert: How about embedding Vortex?
>          -
>
>             Stated goal not to embed opaque encodings, schemes.
>             -
>
>             Embed vortex flatbuffer footer
>             -
>
>                Readers who can parse the footer can treat the opaque
>                encoding as transparent
>                -
>
>             Input from other projects is welcome.
>             -
>
>       TODO:
>       -
>
>          Shared doc to articulate
>          -
>
>             Jiayi, Divjot, Will, Gunnar, Alkis, Robert, Rok
>             -
>
>             Content:
>             -
>
>                Describe the problem: large footer, wide schema
>                -
>
>                   Can have big footer with many row groups as well.
>                   -
>
>                   Describe what’s pathological
>                   -
>
>                Describe the options at a high level, point to detailed docs
>                of POC/proposals.
>                -
>
>             Useful to share files with the problem.
>             -
>
>                Difficult
>                -
>
>          Regular meeting. Jiayi: facilitator
>          -
>
>    [Rok] FIXED_SIZE_LIST/VECTOR proposal
>    -
>
>       This is still ongoing.
>       -
>
>       3 options, will write a doc and report to the mailing list.
>       -
>
>       Use case: efficiently store Vectors
>       -
>
>       Micah: how about adding a 4th option: new logical type vector that
>       annotates the existing FLBA type (?) => know you don’t have to read
>       Repetition Levels.
>       -
>
>          Rahil: similar to what is being done in Hudi.
>          -
>
>          Need to discuss dense vectors vs sparse vectors.
>
>
> On Tue, Apr 21, 2026 at 2:53 PM Julien Le Dem <[email protected]> wrote:
>
> > The next Parquet sync is tomorrow Wednesday Apr 22nd at 10am PT - 1pm ET
> > - 7pm CET
> >
> > To join the invite, join the group:
> > https://groups.google.com/g/apache-parquet-community-sync
> >
> > Everybody is welcome, bring your topic or just listen in.
> >
> > (Some more details on how the meeting is run:
> > https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
> >
>

Re: Parquet sync tomorrow Wednesday Apr 22nd

Reply via email to