Re: Parquet sync tomorrow Sep 3rd

Julien Le Dem Wed, 03 Sep 2025 11:32:43 -0700

Notes form the meeting


Agenda:

   -

   Parquet Variant canonical extension type for Apache Arrow
   -

      https://github.com/apache/arrow/pull/47456
      -

      Vote:
      https://lists.apache.org/thread/1x1kt3j5b193oblsk263j00r7yojzjrj5
      <https://lists.apache.org/thread/1x1kt3j5b193oblsk263j00r7yozjrj5>
      -

   Parquet improvement process
   -

   Flatbuffer metadata: [EXTERNAL] Parquet metadata
   
<https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit>
   -

   INT96 final decision on SortOrder vs Not
   -


      
https://github.com/apache/parquet-format/commit/300b018e0beffb4c335b4a0d7763d9edc1f3cd06
      merged a clarification on ordering
      -

   Empty Structs


Notes:

   -

   Variant extension type
   -

      Vote in Arrow to add it. Documentation ongoing
      -

      Open questions:
      -

         Order of the fields should/should not matter ?
         -

            Many utilities reading/writing Parquet<>Arrow are in the arrow
            repo. Many rely on ordering.
            -

            => Changing the order is semantically equivalent.
            -

         Namespacing:
         -

            Metadata on top of existing type. “arrow.” is the prefix for
            Arrow canonical extension type names.
            -

            Question: Should we use “parquet.” or “arrow.parquet.” as the
            prefix for the Parquet Variant canonical extension type name?
            -

            => “arrow.parquet.” sounds like the consensus.
            -

   Parquet improvement process
   -

      Julien to update and merge the process PR:
      https://github.com/apache/parquet-format/pull/513
      -

      Micah to follow up with PR about expectations from the corresponding
      doc.
      -

      Prateek: suggestion to organize datasets to validate with.
      -

      TODO: We will have to figure out the opensource datasets to use for
      validation.
      -

   INT96 final decision on SortOrder vs Not
   -

      No objection around the last update to the spec
      -

      Status quo is fine.
      -

   Flatbuffer metadata:
   -

      PR open is not the final state yet
      -

      Design goals: wide tables
      -

         More compact
         -

         Thrift reading is data dependent. => flatbuf prototype, one-to-one
         mapping is 3-4X bigger in flatbuf (fixed width int)
         -

         Needed second step of removing data:
         -

            Stats encoding is wasteful.
            -

            Unneeded fields (path_in_schema)
            -

            Remove old logical types
            -

            Data that can be moved out of the footer as a 2nd step.
            -

      TODO (Alkis):
      -

         Add encryption
         -

         Add extension mechanism
         -

         Have open source
         -

      Interested in contributing

Rok: rust

Matt: go
TODO: Review and discuss details in next meeting : [EXTERNAL] Parquet
metadata
<https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?usp=drivesdk>

On Tue, Sep 2, 2025 at 4:55 PM Julien Le Dem <[email protected]> wrote:

>
> The next Parquet sync is tomorrow Sep 3rd at 10am PT - 1pm ET - 7pm CET
> I'll facilitate unless someone else wants to do it (feel free to reply to
> this email)
>
> To join the invite, join the group:
> https://groups.google.com/g/apache-parquet-community-sync
>
> Everybody is welcome, bring your topic or just listen in.
>
> (Some more details on how the meeting is run:
> https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
>

Re: Parquet sync tomorrow Sep 3rd

Reply via email to