Re: Parquet sync tomorrow Wednesday Apr 22nd

Julien Le Dem Thu, 23 Apr 2026 14:07:57 -0700

Although, I don’t think we need a discussion to go ahead. Some of the prs 
mentioned in the thread have been merged. What’s left to do is decide if 
there’s anything else that we need to wait on..
 I would encourage to have a bias towards making a release sooner even if there 
are other things pending.  We can always make another one later. 
Best
Julien


> On Apr 23, 2026, at 10:34, Micah Kornfield <[email protected]> wrote:
> 
> No, this was not discussed.
> 
>> On Wed, Apr 22, 2026 at 8:11 PM Manu Zhang <[email protected]> wrote:
>> 
>> Hi Julien,
>> 
>> Thanks for the meeting notes. I wasn't able to attend. Did you discuss a
>> new parquet-java release?
>> 
>> Regards,
>> Manu
>> 
>>> On Thu, Apr 23, 2026 at 7:02 AM Julien Le Dem <[email protected]> wrote:
>>> 
>>> Notes from the meeting:
>>> 
>>> 
>> https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub
>>> Attendees:
>>> 
>>>   -
>>> 
>>>   Micah Kornfield - Databricks - Listening in
>>>   -
>>> 
>>>   Neelesh Salian - Apple - Variant related items
>>>   -
>>> 
>>>   Robert Kruszewski - Spiral - Listening in
>>>   -
>>> 
>>>   Martin Prammer - Spiral - Listening in
>>>   -
>>> 
>>>   Gunnar Morling - Confluent - Listening in
>>>   -
>>> 
>>>   Kenny Daniel - Hyperparam - Listening
>>>   -
>>> 
>>>   Divjot Arora - Databricks - Flatbuf footer
>>>   -
>>> 
>>>   Jiayi Wang - backward-compatible VS incompatible changes (part of
>>>   flatbuf discussion)
>>>   -
>>> 
>>>   Ismaël Mejía - Microsoft - Java Encoding/Decoding perf
>>>   -
>>> 
>>>   Anurag Mantripragada - Apple - Listening in - Variant stuff
>>> 
>>> 
>>>   -
>>> 
>>>   Rok Mihevc: G-Research/Arctos Alliance <https://arctosalliance.org/>,
>>>   Flatbuffers, FIXED_SIZE_LIST/VECTOR proposal
>>>   -
>>> 
>>>   Prateek - Snowflake - Listening in
>>>   -
>>> 
>>>   Benjamin Owad - Snowflake - Listening in
>>> 
>>> 
>>>   -
>>> 
>>>   Dusan Paripovic - RTE , listening in
>>>   -
>>> 
>>>   Will Edwards - Spotify - Listening in
>>>   -
>>> 
>>>   Raúl Cumplido - QuantStack - Listening in
>>>   -
>>> 
>>>   Steve Loughran: Variant performance update (good!)
>>>   -
>>> 
>>>   Mengmeng Chen - Snowflake - listening in
>>>   -
>>> 
>>>   Rahil Chertara - Onehouse - listening in
>>> 
>>> 
>>> Agenda:
>>> 
>>>   -
>>> 
>>>   [Neelesh Salian + Steve Loughran] Variant related items
>>>   -
>>> 
>>>      Iceberg - Variant Community Update
>>>      <
>>> 
>> https://docs.google.com/document/d/1IuhLRxw1rcPD_f4jgHuGe3SwFgy7Y5wgEGvLzf6311s/edit?tab=t.froqj7pg3868#heading=h.r977qio1wsv2
>>>> (Parquet
>>>      items as well)
>>>      -
>>> 
>>>      See doc for Iceberg, Spark and Parquet related items
>>>      -
>>> 
>>>      PRs open for lazy caching…(
>>>      https://github.com/apache/parquet-java/pull/3481)
>>>      -
>>> 
>>>      If you want to help, please reach out! Help welcome. Tracker and
>>>      benchmark in the doc.
>>>      -
>>> 
>>>   [Ismael] Java Encoding/Decoding ask for review
>>>   -
>>> 
>>>      Experimenting with improving open source libraries with AI.
>>>      -
>>> 
>>>      Based on existing benchmarks.
>>>      -
>>> 
>>>      Performance tests and PRs.
>>>      -
>>> 
>>>      Avg 40% improvement on encodings. (write path)
>>>      -
>>> 
>>>      10% on read path.
>>>      -
>>> 
>>>      PRs have been reviewed by ismael: not just ai generated.
>>>      -
>>> 
>>>      Need help with reviews from maintainers.
>>>      -
>>> 
>>>         https://github.com/apache/parquet-java/pull/3512
>>>         -
>>> 
>>>      Gunnar: I've been working on a new Parquet Parser (presented it to
>>>      the group a few weeks back,
>> https://github.com/hardwood-hq/hardwood
>>> );
>>>      solely focused on parsing atm., i.e. decoding. Would love to learn
>>> about
>>>      any improvements in that area, will check out your PRs.
>>>      -
>>> 
>>>   [Divjot + Jiayi + Rok] Flatbuffer footer
>>>   -
>>> 
>>>      Ref to mailing list thread regarding building bw compatible indices
>>>      on thrift footer.
>>>      -
>>> 
>>>      Goal to give faster random access in metadata.
>>>      -
>>> 
>>>      2 options:
>>>      -
>>> 
>>>         Incremental updates: Index on footer + reducing bloat by
>> removing
>>>         less useful metadata.
>>>         -
>>> 
>>>            PR <https://github.com/apache/parquet-format/pull/564> to
>> make
>>>            path_in_schema optional
>>>            -
>>> 
>>>         Bigger rewrite with roll out plan: New Flatbuffer based footer.
>>>         -
>>> 
>>>      Open items:
>>>      -
>>> 
>>>         Handling thrift schema evolution, making fields optional to
>>>         deprecate.
>>>         -
>>> 
>>>         Discuss increased complexity of thrift jump tables.
>>>         -
>>> 
>>>         Finalizing plan for the flatbuffer footer.
>>>         -
>>> 
>>>            Flatbuffer at prototype state?
>>>            -
>>> 
>>>            Proposal:
>>>            -
>>> 
>>>               1) replace everything as in the current proposal
>>>               -
>>> 
>>>               2) make it minimal and more modular with extensions.
>>>               -
>>> 
>>>         We have some internal benchmarks that show that most footers are
>>>         actually smaller when using FlatBuffers after removing bloat
>>> unuseful
>>>         fields. If there's some public e2e benchmarks, let me know.
>>> But of course,
>>>         only readers that adopt flatbuf footer can benefit from it.
>>>         -
>>> 
>>>         Kenny: That assumes making the breaking change of dropping
>> thrift.
>>>         If we stay in a backward compat world then we need both flat
>>> and thrift.
>>>         That makes files (and parsers) much larger more complicated.
>>> I personally
>>>         hate the idea of dropping thrift as it will break a lot of
>>> systems. Making
>>>         a big breaking change is an existential risk to parquet... if
>>> its going to
>>>         be a hard break why wouldnt users consider alternatives at
>>> that point? I
>>>         like the idea of optimizing thrift much more than flatbuffer,
>>> personally.
>>>         -
>>> 
>>>         Gunnar Morling: Yeah, similar sentiment here
>>>         -
>>> 
>>>         Robert: How about embedding Vortex?
>>>         -
>>> 
>>>            Stated goal not to embed opaque encodings, schemes.
>>>            -
>>> 
>>>            Embed vortex flatbuffer footer
>>>            -
>>> 
>>>               Readers who can parse the footer can treat the opaque
>>>               encoding as transparent
>>>               -
>>> 
>>>            Input from other projects is welcome.
>>>            -
>>> 
>>>      TODO:
>>>      -
>>> 
>>>         Shared doc to articulate
>>>         -
>>> 
>>>            Jiayi, Divjot, Will, Gunnar, Alkis, Robert, Rok
>>>            -
>>> 
>>>            Content:
>>>            -
>>> 
>>>               Describe the problem: large footer, wide schema
>>>               -
>>> 
>>>                  Can have big footer with many row groups as well.
>>>                  -
>>> 
>>>                  Describe what’s pathological
>>>                  -
>>> 
>>>               Describe the options at a high level, point to detailed
>> docs
>>>               of POC/proposals.
>>>               -
>>> 
>>>            Useful to share files with the problem.
>>>            -
>>> 
>>>               Difficult
>>>               -
>>> 
>>>         Regular meeting. Jiayi: facilitator
>>>         -
>>> 
>>>   [Rok] FIXED_SIZE_LIST/VECTOR proposal
>>>   -
>>> 
>>>      This is still ongoing.
>>>      -
>>> 
>>>      3 options, will write a doc and report to the mailing list.
>>>      -
>>> 
>>>      Use case: efficiently store Vectors
>>>      -
>>> 
>>>      Micah: how about adding a 4th option: new logical type vector that
>>>      annotates the existing FLBA type (?) => know you don’t have to read
>>>      Repetition Levels.
>>>      -
>>> 
>>>         Rahil: similar to what is being done in Hudi.
>>>         -
>>> 
>>>         Need to discuss dense vectors vs sparse vectors.
>>> 
>>> 
>>>> On Tue, Apr 21, 2026 at 2:53 PM Julien Le Dem <[email protected]> wrote:
>>> 
>>>> The next Parquet sync is tomorrow Wednesday Apr 22nd at 10am PT - 1pm
>> ET
>>>> - 7pm CET
>>>> 
>>>> To join the invite, join the group:
>>>> https://groups.google.com/g/apache-parquet-community-sync
>>>> 
>>>> Everybody is welcome, bring your topic or just listen in.
>>>> 
>>>> (Some more details on how the meeting is run:
>>>> https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
>>>> 
>>> 
>>

Re: Parquet sync tomorrow Wednesday Apr 22nd

Reply via email to