Re: Parquet Sync May 13th 2025

Julien Le Dem Mon, 19 May 2025 17:55:09 -0700

Sorry for the delay, it's been an unexpectedly eventful end of week and
weekend for me.
Thank you Talat for setting up recording and automated summary.


Recording on the @ApacheParquet <https://www.youtube.com/@ApacheParquet>
channel: https://youtu.be/dV2sWlxNshY?si=ldZOPShZQCZwdCsD

For manual notes, see below.
AI meeting notes:
https://docs.google.com/document/d/1Eal2t_I9jFL1lzXT8XJsoATt1bGVpB9NwVSMR-D-Xew
Only those subscribed to [email protected]
<https://groups.google.com/g/apache-parquet-community-sync> can access
them. subscribing to the list gets you on the invite for the
recurring meeting.
They come with the caveat that AI doesn't really have context on what we're
talking about. I was wondering what was the VUS project I wanted to have an
update on, but that's just how my saying "various projects" apparently
sounds like. You've been warned.

Here are the bespoke, globally sourced, human[e]ly produced notes:

Attendees:

   -

   Adam: G Research. ParquetSharp
   -

   Dan Weeks: DB, type support
   -

   Fokko: DB.
   -

   Jeff: Snowflake, numeric encodings, footer format (no need to parse the
   entire thing to parse )
   -

   Jiaying: CMU, rust variant implementation
   -

   MengMeng: snowflake, scan.
   -

   Micah: DB,
   -

   Neil: SNowflake, types
   -

   Rok: footer for wide schema
   -

   Ron: Ali Baba.
   -

   Russell: Snowflake
   -

   Ryan: DB, heads up on change for geotypes. Variant. Types footer
   -

   Sai: snowflake. Data types
   -

   Talat: Google
   -

   Vincent: google BQ,
   -

   Selcuk
   -

   Prateek Gaur


Notes

   -

   https://www.youtube.com/@ApacheParquet
   -

   Geo and variant:
   -

      Geo type update:
      -

         Handling NaN values is problematic
         -

         Remove NaN value from all bounds
         -

         Because not < is not the same as >=
         -

      Variant
      -

         Converters in any object model that produce byte buffers.
         -

         First model => avro
         -

         Open PR for the write path
         -

            PR 3212 on parquet-java repo
            <https://github.com/apache/parquet-java/pull/3212/>
            -

   New footer
   -

      Update on footer. Alkis on vacation
      -

      Requirements:
      -

         Parsing can be slow
         -

         Options:
         -

            Decompose into subthrift
            -

            Flatbuffer
            -

         How to phase out the old footer in a slow way.
         -

         Column statistics?
         -

            https://github.com/apache/arrow/pull/43793
            -

         TODO: follow up on the footer.
         -

         Parquet Improvements
         
<https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit?tab=t.0#heading=h.c1q8jd51inuh>
[EXTERNAL]
         Parquet Metadata evolution
         
<https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0>
         -

   New numeric encoding
   -

      Action item on writing process on how to adopt new encodings.
      -

      Need Faster encoding:
      -

         Gzipped data is super slow
         -

         Bitpacking, pfor, …
         -

         Decent compression and better performance
         -

         Existing delta not used much.
         -

         For: a lot faster
         -

         Existing discussion: Parquet Improvements
         
<https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit?tab=t.0#heading=h.vtu381dko9im>
         -

         BTR blocks: stacked encodings, does it help?
         -

         Pcodec
         -

         Choice between compression vs decoding speed.
         -

         Next step:
         -

            Micah: publish draft on process
            -

            Work on practical next step:
            -

               Selcuk, Jeff
               -

               Talat
               -

               Micah, Alkis
               -

         Need to deprecate some of the old V2 encodings
         -

         Delta byte array: FSST. Fast Static Symbol Table.
         -

            https://github.com/cwida/fsst
            -

         Need to have datasets to validate encoding
         -

            What’s a good dataset to validate?
            -

            Combination of public datasets => reproducible
            -

            Private datasets => Real world.
            -

            Sorted vs unsorted/randomized


   -

   Types:
   -

      Increase decimal precision
      -

      Need interval types to be fixed to match sql spec. Maybe Russel will
      pick it up.
      -

      Integer 128? In Arrow first? Vincent to write down what to achieve.
      -

         Statistics
         -

         encodings


On Tue, May 13, 2025 at 6:09 PM Julien Le Dem <[email protected]> wrote:

> The next Parquet sync is tomorrow May 13th at 10am PT - 1pm ET - 7pm CET
> To join the invite, join the group:
> https://groups.google.com/g/apache-parquet-community-sync
>
> Please contact me to be added to the recurring invite. (every two weeks)
> Everybody is welcome, bring your topic or just listen in.
>
> (Some more details on how the meeting is run:
> https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
>

Re: Parquet Sync May 13th 2025

Reply via email to