Re: Parquet sync today Wednesday Mar 11th

Julien Le Dem Wed, 11 Mar 2026 13:40:26 -0700

Notes:
https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub
Mar 11, 2026Attendees:


   -

   Micah Kornfield: Databricks (Nothing to discuss)
   -

   Robert Kruszewski: Spiral (Nothing to discuss)
   -

   Connor Tsui: Vortex (listening in)
   -

   Jiayi Wang: Databricks (Flatbuffers)
   -

   Will Edwards: Spotify (listening in)
   -

   Prateek Gaur : Snowflake
   -

   Ben Owad: Snowflake (listening in)
   -

   Jiaying Li: Snowflake(listening in)
   -

   Rok Mihevc: G-Research/Arctos Alliance <https://arctosalliance.org/>,
   Flatbuffers, FIXED_SIZE_LIST
   -

   Vinoo (had to drop early - sorry!): Kepler. ALP Java (caught up with
   Prateek), Misc final website changes
   -

   Arnav: Uber: ALP Go, FSST Spec review
   -

   Julien: Datadog (listening in)
   -

   Russell: Snowflake (listening in)
   -

   Andrew Lamb: InfluxData (Nothing to discuss)
   -

   Rahil Chertara: Onehouse (Listening in)
   -

   Fokko Driesprong (Listening in)


Agenda:

   -

   Flatbuffers <https://github.com/apache/parquet-format/pull/544> (GH-531:
   Add parquet flatbuf schema #544)
   -

      Back to working on it after a pause
      -

      Please have a look.
      -

      Ready to merge from the author’s perspective.
      -

      Rok:
      -

         main comment: geospatial statistics were omitted.
         -

         Minor comments still ongoing
         -

      Comments were addressed by Jiayi
      -

      Need Alkis to resolve comments
      -

      Micah: need to review again
      -

         Need to discuss: how to finalize the plan to make it a permanent
         footer.
         -

         Rough plan, replace the magic number to make it the new footer:
         -

            Encryption to be finalized (mostly addressed now)
            -

            Extension of the extension?
            -

            Removes some statistics:
            -

               Distinct count
               -

                  Will Edwards (Spotify) uses distinct count in internal
                  project. To pick the right join.
                  -

                  TODO: discuss the trade off of storing them. => we don’t
                  want to drop it..
                  -

               Histograms.
               -

            We can evolve the format as we go.
            -

      TODO: discuss the trade off of adding back distinct count and other
      dropped statistics. We need a good solution.
      -

   Fixed size list PR <https://github.com/apache/parquet-format/pull/241>,
   ML <https://lists.apache.org/thread/soqd69k8y7b6z0sxbmgrbxcwxbvlj353>
   -

      New data type: avoid DL
      -

      Option 1: byte array logical type. => not good for encodings
      -

      Option 2 (antoine) : vector dl => breaking
      -

      Option 3 (Micah): moving away from dremel encodings. =>breaking
      -

      Other options:
      -

         If you know it’s fixed size => skip reading RL.
         -

         Intermediate solution vs long term?
         -

      TODO:
      -

         Discuss pros and cons on the ML. Summarize the options.
         -

         Rok to follow up on the mailing list with the help of Micah
         -

   JSON to Variant Parsing PR:
   https://github.com/apache/parquet-java/pull/3415
   -

      Ported from the Spark repo.
      -

      TODO: add a notice (
      https://github.com/apache/parquet-java/blob/master/NOTICE)
      -

   Encodings
   -

      Correctness testing
      -

      Cross language testing
      -

   ALP
   -

      1) Spec document :
      https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit
      -

      2) Spec document in parquet format repo :
      https://github.com/apache/parquet-format/pull/557
      -

      3) Alp implementation in arrow c++ repo :
      https://github.com/apache/arrow/pull/48345/changes
      -

      benchmarking artifacts in parquet-testing repo :
      https://github.com/apache/parquet-testing/pull/100
      -

      Working with Vinoo on cross-language testing
      -

         Write from Cpp -> read from java and vice-versa.
         -

         Andrew: can we have small files?
         -

            Trimmed to 15,000 rows. => a few MBs
            -

      Rust impl in progress.
      -

      https://gist.github.com/pitrou/1f4aefb7034657ce018231d87993f437
      -

       Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, ARM
      Neoverse V1)
      -


        
|—-------------------------------------------------------------------------
      -

        │ Column           │  -O2 (MB/s)  │  -O3 (MB/s)  │ Speedup │
      -


        
—-------------------------------------------------------------------------
      -

        │ valence          │     3,155       │     5,523    │  1.75x  │
      -

        │ danceability   │     3,233       │     5,685    │  1.76x  │
      -

        │ energy           │     3,197       │     5,652    │  1.77x  │
      -

        │ loudness        │     3,186       │     5,473    │  1.72x  │
      -

        └──────────────────┴──────────────┴──
      -
      -

      Arrow cpp with o2 instead of o3 flag. => 60% speedup with o3 flag.
      -

         Can we use o3 instead?
         -

         O3 is living on the edge.
         -

      Vinoo will publish java numbers.
      -

      Next steps:
      -

         Some stylistic comments to finalize the pr (no crashing)
         -

         Once testing is finalized. (2 different factor sizes, NaN, …)
         including for cross impl testing.
         -

         Vote to officially approve the new encoding
         -

         Celebrate 🎉
         -

   FSST
   -

      1) Spec doc:
      -


      
https://docs.google.com/document/d/1Xg2b8HR19QnI3nhtQUDWZJhCLwJzW6y9tU1ziiLFZrM/edit?tab=t.0#heading=h.a9r0tnd6fhtq

      -

      2) Pending comments addressed, numbers added for lz4, Delta BA
      -

      3) Check in Parquet format if no additional comments
      -

      Draft go implementation ongoing.
      -

      TODO: Micah and Gang to give another review. (and others!)
      -

   "File" Logical Type:
   https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy
   -

      Reference to another file in blob storage
      -

      If small enough: byte array.
      -

      Content-type? Etag?
      -

      Is it a table format concept?
      -

      2 separate problems:
      -

         Medium size blobs => in pqt
         -

      Russ:
      -

         All this info is already in the compressed format.
         -

         External files are offset references.
         -

         lifecycle management?
         -

      Should we have an extension logical type just for this?
      -

         Shared across delta/icbg
         -

      Rahil:
      -

         Should we offer a way to materialize the blob?
         -

      Easier to do lifecycle and governance if data is colocated in the
      columnar format.
      -

      Who’s the consumer? Spark? PyTorch?
      -

      TODO: follow up on the mailing list. Re logical types.


On Wed, Mar 11, 2026 at 7:40 AM Julien Le Dem <[email protected]> wrote:

> The next Parquet sync is today Wednesday Mar 11th at 10am PT - 1pm ET -
> *6pm CET*(because of the daylight saving time change not being on the
> same date in US and EU, the meeting is 1h earlier than usual in CET TZ)
>
> To join the invite, join the group:
> https://groups.google.com/g/apache-parquet-community-sync
>
> Everybody is welcome, bring your topic or just listen in.
>
> (Some more details on how the meeting is run:
> https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
>

Re: Parquet sync today Wednesday Mar 11th

Reply via email to