Notes: https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub Mar 11, 2026Attendees:
- Micah Kornfield: Databricks (Nothing to discuss) - Robert Kruszewski: Spiral (Nothing to discuss) - Connor Tsui: Vortex (listening in) - Jiayi Wang: Databricks (Flatbuffers) - Will Edwards: Spotify (listening in) - Prateek Gaur : Snowflake - Ben Owad: Snowflake (listening in) - Jiaying Li: Snowflake(listening in) - Rok Mihevc: G-Research/Arctos Alliance <https://arctosalliance.org/>, Flatbuffers, FIXED_SIZE_LIST - Vinoo (had to drop early - sorry!): Kepler. ALP Java (caught up with Prateek), Misc final website changes - Arnav: Uber: ALP Go, FSST Spec review - Julien: Datadog (listening in) - Russell: Snowflake (listening in) - Andrew Lamb: InfluxData (Nothing to discuss) - Rahil Chertara: Onehouse (Listening in) - Fokko Driesprong (Listening in) Agenda: - Flatbuffers <https://github.com/apache/parquet-format/pull/544> (GH-531: Add parquet flatbuf schema #544) - Back to working on it after a pause - Please have a look. - Ready to merge from the author’s perspective. - Rok: - main comment: geospatial statistics were omitted. - Minor comments still ongoing - Comments were addressed by Jiayi - Need Alkis to resolve comments - Micah: need to review again - Need to discuss: how to finalize the plan to make it a permanent footer. - Rough plan, replace the magic number to make it the new footer: - Encryption to be finalized (mostly addressed now) - Extension of the extension? - Removes some statistics: - Distinct count - Will Edwards (Spotify) uses distinct count in internal project. To pick the right join. - TODO: discuss the trade off of storing them. => we don’t want to drop it.. - Histograms. - We can evolve the format as we go. - TODO: discuss the trade off of adding back distinct count and other dropped statistics. We need a good solution. - Fixed size list PR <https://github.com/apache/parquet-format/pull/241>, ML <https://lists.apache.org/thread/soqd69k8y7b6z0sxbmgrbxcwxbvlj353> - New data type: avoid DL - Option 1: byte array logical type. => not good for encodings - Option 2 (antoine) : vector dl => breaking - Option 3 (Micah): moving away from dremel encodings. =>breaking - Other options: - If you know it’s fixed size => skip reading RL. - Intermediate solution vs long term? - TODO: - Discuss pros and cons on the ML. Summarize the options. - Rok to follow up on the mailing list with the help of Micah - JSON to Variant Parsing PR: https://github.com/apache/parquet-java/pull/3415 - Ported from the Spark repo. - TODO: add a notice ( https://github.com/apache/parquet-java/blob/master/NOTICE) - Encodings - Correctness testing - Cross language testing - ALP - 1) Spec document : https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit - 2) Spec document in parquet format repo : https://github.com/apache/parquet-format/pull/557 - 3) Alp implementation in arrow c++ repo : https://github.com/apache/arrow/pull/48345/changes - benchmarking artifacts in parquet-testing repo : https://github.com/apache/parquet-testing/pull/100 - Working with Vinoo on cross-language testing - Write from Cpp -> read from java and vice-versa. - Andrew: can we have small files? - Trimmed to 15,000 rows. => a few MBs - Rust impl in progress. - https://gist.github.com/pitrou/1f4aefb7034657ce018231d87993f437 - Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, ARM Neoverse V1) - |—------------------------------------------------------------------------- - │ Column │ -O2 (MB/s) │ -O3 (MB/s) │ Speedup │ - —------------------------------------------------------------------------- - │ valence │ 3,155 │ 5,523 │ 1.75x │ - │ danceability │ 3,233 │ 5,685 │ 1.76x │ - │ energy │ 3,197 │ 5,652 │ 1.77x │ - │ loudness │ 3,186 │ 5,473 │ 1.72x │ - └──────────────────┴──────────────┴── - - Arrow cpp with o2 instead of o3 flag. => 60% speedup with o3 flag. - Can we use o3 instead? - O3 is living on the edge. - Vinoo will publish java numbers. - Next steps: - Some stylistic comments to finalize the pr (no crashing) - Once testing is finalized. (2 different factor sizes, NaN, …) including for cross impl testing. - Vote to officially approve the new encoding - Celebrate 🎉 - FSST - 1) Spec doc: - https://docs.google.com/document/d/1Xg2b8HR19QnI3nhtQUDWZJhCLwJzW6y9tU1ziiLFZrM/edit?tab=t.0#heading=h.a9r0tnd6fhtq - 2) Pending comments addressed, numbers added for lz4, Delta BA - 3) Check in Parquet format if no additional comments - Draft go implementation ongoing. - TODO: Micah and Gang to give another review. (and others!) - "File" Logical Type: https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy - Reference to another file in blob storage - If small enough: byte array. - Content-type? Etag? - Is it a table format concept? - 2 separate problems: - Medium size blobs => in pqt - Russ: - All this info is already in the compressed format. - External files are offset references. - lifecycle management? - Should we have an extension logical type just for this? - Shared across delta/icbg - Rahil: - Should we offer a way to materialize the blob? - Easier to do lifecycle and governance if data is colocated in the columnar format. - Who’s the consumer? Spark? PyTorch? - TODO: follow up on the mailing list. Re logical types. On Wed, Mar 11, 2026 at 7:40 AM Julien Le Dem <[email protected]> wrote: > The next Parquet sync is today Wednesday Mar 11th at 10am PT - 1pm ET - > *6pm CET*(because of the daylight saving time change not being on the > same date in US and EU, the meeting is 1h earlier than usual in CET TZ) > > To join the invite, join the group: > https://groups.google.com/g/apache-parquet-community-sync > > Everybody is welcome, bring your topic or just listen in. > > (Some more details on how the meeting is run: > https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t ) >
