Meeting Notes:
https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub
Attendees:

   -

   Julien Le Dem: Datadog, interested in updates on ongoing projects (ALP,
   fixed-length-array, metadata, …) and next release
   -

   Dusan Paripovic RTE, listening in
   -

   Neelesh Salian Apple, listening in
   -

   Martin Prammer: Spiral, Datasets Project (Raincloud)
   -

   Connor Tsui: Spiral, listening in
   -

   Andrew Lamb (InfluxData), listening in
   -

   Ismaël Mejía: Microsoft, performance improvements on Java
   -

   Rok Mihevc: G-Research/Arctos Alliance <https://arctosalliance.org/>,
   Flatbuffers, vector-like datatype proposal
   -

   Kenny Daniel, Hyperparam, listening
   -

   Russell Spitzer, Snowflake, Listening in
   -

   Will Edwards: Spotify, listening in
   -

   Robert Kruszewski: Spiral, listening in
   -

   Amogh Jahagirdar: Databricks, listening in
   -

   Micah Kornfield - Databricks Listening
   -

   Daniel Weeks: Databricks, format improvements (footer, encodings, pages,
   types)
   -

   Arnav Balyan - FSST
   -

   Jiayi Wang - Databricks
   -

   Benjamin Owad - Snowflake (listening in)
   -

   Ashish Paliwal - Apple (listening in)
   -

   Jiaying Li = Snowflake (listening in)


Agenda:

   -

   Parquet-java Release.
   -

      Err on the side of releasing often
      -

      Gang is helping to make the release
      -

      Ideally, this is more automated. Apache infra working on this.
      -

      Small group of release managers per project.
      -

      TODO: start 2 threads
      -

         Russel: release automation
         -

         Raising your hand to shepherd release scope definition.
         -

   [Andrew, Julien] Finalizing ALP (Floating Point Encoding for Parquet).
   -

      Mailing list:
      https://lists.apache.org/thread/cg68jco16ltqs6xrwphol5co8o2yjhpf
      -

      Andrew/Micah/Antoine reviewing the spec:
      https://github.com/apache/parquet-format/pull/557
      -

      Parquet-format examples in
      https://github.com/apache/parquet-testing/pull/100 (Andrew thinks
      they are are quite large)
      -

      Needs: Reviews of C++ implementation and Java implementation
      -

      Andrew: will review Rust implementation “soon”
      https://github.com/apache/arrow-rs/pull/9372
      -

      Hyparquet (javascript) ALP branch:
      https://github.com/hyparam/hyparquet/pull/161
      -

      Roll out plan:
      -

         Using ALP is behind a flag to enable read but write
         -

         Model of fairly granular releases? 1 thing at a time. Example of
         iceberg model.
         -

         Here is the current implementation status page (notes dates when
         implementations supported the feature, not versions):
         https://parquet.apache.org/docs/file-format/implementationstatus/
         -

            See “Minimum Version for Read Support by Year” table as the
            current “state of the art”
            -

         Decision points:
         -

            Is it linear or not: supporting Vx, means supporting everything
            before
            -

            Feature flags that turn into a v.
            -

         TODO:
         -

            Thread:
            -

               Review existing process voted upon
               https://lists.apache.org/thread/nq7n6pbp222txrfo232ybgpvlvpmykbp
               and see what is missing
               -

               Clarify the feature flag as a standard across
               implementations.
               -

               Clarify how alp becomes on by default.
               -

               Not a blocker to releasing ALP behind a flag.
               -

      Pending validation:
      -

         C++ unsigned ints. => need more test on the java perspective.
         -

   Parquet metadata (footer work) progress.
   -

      Jiayi: experiments, making every field modular
      -

         Flatbuff might not be necessary.
         -

      TODO:
      -

         Discuss further in the working group.
         -

         Update on the list.
         -

   [Weeks] Parquet: Non-contiguous Pages
   
<https://docs.google.com/document/d/1nntcYM98PFSkHT70RexSBPtCnWqg1uRJ5_7m--ZgbsA>
   -

      TODO: look at the proposal posted on the list above.
      -

      There is interest in the project to address this head on.
      -

      The problem of asymmetric column size is not new.
      -

   [Rok] Discuss new vector-like datatype proposal -  Parquet fixed-size
   list type
   
<https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?tab=t.0>
   or reply on the ML
   https://lists.apache.org/thread/rolncdtobpmdqmqcr3ry087yhfw210l3
   -

      3-4 proposals discussed so far.
      -

      The doc has an analysis of the pros and cons of each.
      -

      TODO:
      -

         Please comment in the doc with feedback. We’ll discuss next time
         with the goal of making a decision.
         -

         Preferred option: New vector_repetition type.
         -

            Improve on read and write. Not writing repetition_level. Still
            has nullability info (optional).
            -

         Daniel to ask question in the doc regarding whether this work well
         with encodings.
         -

            O(1) read constraint?
            -

   [Ismael] Java encoding/decoding performance.
   -

      15 PRs (5 more yay!), 3 approved, 5 in progress, 7 to be reviewed
      -

      https://github.com/apache/parquet-java/issues/3530
      -

      PRs have been broken up for easier and more efficient review.
      -

      TODO:
      -

         Need reviews! Pretty please 🙂
         -

         Merge approved PRs.
         -

         Feel free to reply to the thread on the ML.
         -

   [Martin] Datasets Project - https://github.com/spiraldb/raincloud
   -

      Make it easy to evaluate file formats in a reproducible framework
      with public datasets.
      -

      Make sure to cover all types and encodings for validating Parquet.
      Not necessarily scale like TPCH.
      -

      Don’t want to redistribute someone else's crawled data because of
      licensing constraints.
      -

      Future goal to generate good parquet datasets for evaluation.
      -

      Feedback welcome on where this is going next.
      -

         Ex: Not pick one implementation over another.
         -

   [Will E] effort to add defensive validating in mainstream open source
   readers?
   -

      Have more formalized list of checks that readers should have to have
      better errors and dealing with forward compat and introducing breaking
      changes gracefully.


On Wed, May 6, 2026 at 7:20 AM Julien Le Dem <[email protected]> wrote:

> The next Parquet sync is today Wednesday May 6th at 10am PT - 1pm ET -
> 7pm CET (in ~2.5h)
>
> To join the invite, join the group:
> https://groups.google.com/g/apache-parquet-community-sync
>
> Everybody is welcome, bring your topic or just listen in.
>
> (Some more details on how the meeting is run:
> https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
>

Reply via email to