Re: Parquet sync tomorrow Wednesday Apr 22nd

Micah Kornfield Thu, 23 Apr 2026 10:34:22 -0700

No, this was not discussed.

On Wed, Apr 22, 2026 at 8:11 PM Manu Zhang <[email protected]> wrote:


> Hi Julien,
>
> Thanks for the meeting notes. I wasn't able to attend. Did you discuss a
> new parquet-java release?
>
> Regards,
> Manu
>
> On Thu, Apr 23, 2026 at 7:02 AM Julien Le Dem <[email protected]> wrote:
>
> > Notes from the meeting:
> >
> >
> https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub
> > Attendees:
> >
> >    -
> >
> >    Micah Kornfield - Databricks - Listening in
> >    -
> >
> >    Neelesh Salian - Apple - Variant related items
> >    -
> >
> >    Robert Kruszewski - Spiral - Listening in
> >    -
> >
> >    Martin Prammer - Spiral - Listening in
> >    -
> >
> >    Gunnar Morling - Confluent - Listening in
> >    -
> >
> >    Kenny Daniel - Hyperparam - Listening
> >    -
> >
> >    Divjot Arora - Databricks - Flatbuf footer
> >    -
> >
> >    Jiayi Wang - backward-compatible VS incompatible changes (part of
> >    flatbuf discussion)
> >    -
> >
> >    Ismaël Mejía - Microsoft - Java Encoding/Decoding perf
> >    -
> >
> >    Anurag Mantripragada - Apple - Listening in - Variant stuff
> >
> >
> >    -
> >
> >    Rok Mihevc: G-Research/Arctos Alliance <https://arctosalliance.org/>,
> >    Flatbuffers, FIXED_SIZE_LIST/VECTOR proposal
> >    -
> >
> >    Prateek - Snowflake - Listening in
> >    -
> >
> >    Benjamin Owad - Snowflake - Listening in
> >
> >
> >    -
> >
> >    Dusan Paripovic - RTE , listening in
> >    -
> >
> >    Will Edwards - Spotify - Listening in
> >    -
> >
> >    Raúl Cumplido - QuantStack - Listening in
> >    -
> >
> >    Steve Loughran: Variant performance update (good!)
> >    -
> >
> >    Mengmeng Chen - Snowflake - listening in
> >    -
> >
> >    Rahil Chertara - Onehouse - listening in
> >
> >
> > Agenda:
> >
> >    -
> >
> >    [Neelesh Salian + Steve Loughran] Variant related items
> >    -
> >
> >       Iceberg - Variant Community Update
> >       <
> >
> https://docs.google.com/document/d/1IuhLRxw1rcPD_f4jgHuGe3SwFgy7Y5wgEGvLzf6311s/edit?tab=t.froqj7pg3868#heading=h.r977qio1wsv2
> > >(Parquet
> >       items as well)
> >       -
> >
> >       See doc for Iceberg, Spark and Parquet related items
> >       -
> >
> >       PRs open for lazy caching…(
> >       https://github.com/apache/parquet-java/pull/3481)
> >       -
> >
> >       If you want to help, please reach out! Help welcome. Tracker and
> >       benchmark in the doc.
> >       -
> >
> >    [Ismael] Java Encoding/Decoding ask for review
> >    -
> >
> >       Experimenting with improving open source libraries with AI.
> >       -
> >
> >       Based on existing benchmarks.
> >       -
> >
> >       Performance tests and PRs.
> >       -
> >
> >       Avg 40% improvement on encodings. (write path)
> >       -
> >
> >       10% on read path.
> >       -
> >
> >       PRs have been reviewed by ismael: not just ai generated.
> >       -
> >
> >       Need help with reviews from maintainers.
> >       -
> >
> >          https://github.com/apache/parquet-java/pull/3512
> >          -
> >
> >       Gunnar: I've been working on a new Parquet Parser (presented it to
> >       the group a few weeks back,
> https://github.com/hardwood-hq/hardwood
> > );
> >       solely focused on parsing atm., i.e. decoding. Would love to learn
> > about
> >       any improvements in that area, will check out your PRs.
> >       -
> >
> >    [Divjot + Jiayi + Rok] Flatbuffer footer
> >    -
> >
> >       Ref to mailing list thread regarding building bw compatible indices
> >       on thrift footer.
> >       -
> >
> >       Goal to give faster random access in metadata.
> >       -
> >
> >       2 options:
> >       -
> >
> >          Incremental updates: Index on footer + reducing bloat by
> removing
> >          less useful metadata.
> >          -
> >
> >             PR <https://github.com/apache/parquet-format/pull/564> to
> make
> >             path_in_schema optional
> >             -
> >
> >          Bigger rewrite with roll out plan: New Flatbuffer based footer.
> >          -
> >
> >       Open items:
> >       -
> >
> >          Handling thrift schema evolution, making fields optional to
> >          deprecate.
> >          -
> >
> >          Discuss increased complexity of thrift jump tables.
> >          -
> >
> >          Finalizing plan for the flatbuffer footer.
> >          -
> >
> >             Flatbuffer at prototype state?
> >             -
> >
> >             Proposal:
> >             -
> >
> >                1) replace everything as in the current proposal
> >                -
> >
> >                2) make it minimal and more modular with extensions.
> >                -
> >
> >          We have some internal benchmarks that show that most footers are
> >          actually smaller when using FlatBuffers after removing bloat
> > unuseful
> >          fields. If there's some public e2e benchmarks, let me know.
> > But of course,
> >          only readers that adopt flatbuf footer can benefit from it.
> >          -
> >
> >          Kenny: That assumes making the breaking change of dropping
> thrift.
> >          If we stay in a backward compat world then we need both flat
> > and thrift.
> >          That makes files (and parsers) much larger more complicated.
> > I personally
> >          hate the idea of dropping thrift as it will break a lot of
> > systems. Making
> >          a big breaking change is an existential risk to parquet... if
> > its going to
> >          be a hard break why wouldnt users consider alternatives at
> > that point? I
> >          like the idea of optimizing thrift much more than flatbuffer,
> > personally.
> >          -
> >
> >          Gunnar Morling: Yeah, similar sentiment here
> >          -
> >
> >          Robert: How about embedding Vortex?
> >          -
> >
> >             Stated goal not to embed opaque encodings, schemes.
> >             -
> >
> >             Embed vortex flatbuffer footer
> >             -
> >
> >                Readers who can parse the footer can treat the opaque
> >                encoding as transparent
> >                -
> >
> >             Input from other projects is welcome.
> >             -
> >
> >       TODO:
> >       -
> >
> >          Shared doc to articulate
> >          -
> >
> >             Jiayi, Divjot, Will, Gunnar, Alkis, Robert, Rok
> >             -
> >
> >             Content:
> >             -
> >
> >                Describe the problem: large footer, wide schema
> >                -
> >
> >                   Can have big footer with many row groups as well.
> >                   -
> >
> >                   Describe what’s pathological
> >                   -
> >
> >                Describe the options at a high level, point to detailed
> docs
> >                of POC/proposals.
> >                -
> >
> >             Useful to share files with the problem.
> >             -
> >
> >                Difficult
> >                -
> >
> >          Regular meeting. Jiayi: facilitator
> >          -
> >
> >    [Rok] FIXED_SIZE_LIST/VECTOR proposal
> >    -
> >
> >       This is still ongoing.
> >       -
> >
> >       3 options, will write a doc and report to the mailing list.
> >       -
> >
> >       Use case: efficiently store Vectors
> >       -
> >
> >       Micah: how about adding a 4th option: new logical type vector that
> >       annotates the existing FLBA type (?) => know you don’t have to read
> >       Repetition Levels.
> >       -
> >
> >          Rahil: similar to what is being done in Hudi.
> >          -
> >
> >          Need to discuss dense vectors vs sparse vectors.
> >
> >
> > On Tue, Apr 21, 2026 at 2:53 PM Julien Le Dem <[email protected]> wrote:
> >
> > > The next Parquet sync is tomorrow Wednesday Apr 22nd at 10am PT - 1pm
> ET
> > > - 7pm CET
> > >
> > > To join the invite, join the group:
> > > https://groups.google.com/g/apache-parquet-community-sync
> > >
> > > Everybody is welcome, bring your topic or just listen in.
> > >
> > > (Some more details on how the meeting is run:
> > > https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
> > >
> >
>

Re: Parquet sync tomorrow Wednesday Apr 22nd

Reply via email to