Notes form the meeting
Agenda:
-
Parquet Variant canonical extension type for Apache Arrow
-
https://github.com/apache/arrow/pull/47456
-
Vote:
https://lists.apache.org/thread/1x1kt3j5b193oblsk263j00r7yojzjrj5
<https://lists.apache.org/thread/1x1kt3j5b193oblsk263j00r7yozjrj5>
-
Parquet improvement process
-
Flatbuffer metadata: [EXTERNAL] Parquet metadata
<https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit>
-
INT96 final decision on SortOrder vs Not
-
https://github.com/apache/parquet-format/commit/300b018e0beffb4c335b4a0d7763d9edc1f3cd06
merged a clarification on ordering
-
Empty Structs
Notes:
-
Variant extension type
-
Vote in Arrow to add it. Documentation ongoing
-
Open questions:
-
Order of the fields should/should not matter ?
-
Many utilities reading/writing Parquet<>Arrow are in the arrow
repo. Many rely on ordering.
-
=> Changing the order is semantically equivalent.
-
Namespacing:
-
Metadata on top of existing type. “arrow.” is the prefix for
Arrow canonical extension type names.
-
Question: Should we use “parquet.” or “arrow.parquet.” as the
prefix for the Parquet Variant canonical extension type name?
-
=> “arrow.parquet.” sounds like the consensus.
-
Parquet improvement process
-
Julien to update and merge the process PR:
https://github.com/apache/parquet-format/pull/513
-
Micah to follow up with PR about expectations from the corresponding
doc.
-
Prateek: suggestion to organize datasets to validate with.
-
TODO: We will have to figure out the opensource datasets to use for
validation.
-
INT96 final decision on SortOrder vs Not
-
No objection around the last update to the spec
-
Status quo is fine.
-
Flatbuffer metadata:
-
PR open is not the final state yet
-
Design goals: wide tables
-
More compact
-
Thrift reading is data dependent. => flatbuf prototype, one-to-one
mapping is 3-4X bigger in flatbuf (fixed width int)
-
Needed second step of removing data:
-
Stats encoding is wasteful.
-
Unneeded fields (path_in_schema)
-
Remove old logical types
-
Data that can be moved out of the footer as a 2nd step.
-
TODO (Alkis):
-
Add encryption
-
Add extension mechanism
-
Have open source
-
Interested in contributing
Rok: rust
Matt: go
TODO: Review and discuss details in next meeting : [EXTERNAL] Parquet
metadata
<https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?usp=drivesdk>
On Tue, Sep 2, 2025 at 4:55 PM Julien Le Dem <[email protected]> wrote:
>
> The next Parquet sync is tomorrow Sep 3rd at 10am PT - 1pm ET - 7pm CET
> I'll facilitate unless someone else wants to do it (feel free to reply to
> this email)
>
> To join the invite, join the group:
> https://groups.google.com/g/apache-parquet-community-sync
>
> Everybody is welcome, bring your topic or just listen in.
>
> (Some more details on how the meeting is run:
> https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
>