Meeting Notes: https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub Attendees:
- Julien Le Dem: Datadog, interested in updates on ongoing projects (ALP, fixed-length-array, metadata, …) and next release - Dusan Paripovic RTE, listening in - Neelesh Salian Apple, listening in - Martin Prammer: Spiral, Datasets Project (Raincloud) - Connor Tsui: Spiral, listening in - Andrew Lamb (InfluxData), listening in - Ismaël Mejía: Microsoft, performance improvements on Java - Rok Mihevc: G-Research/Arctos Alliance <https://arctosalliance.org/>, Flatbuffers, vector-like datatype proposal - Kenny Daniel, Hyperparam, listening - Russell Spitzer, Snowflake, Listening in - Will Edwards: Spotify, listening in - Robert Kruszewski: Spiral, listening in - Amogh Jahagirdar: Databricks, listening in - Micah Kornfield - Databricks Listening - Daniel Weeks: Databricks, format improvements (footer, encodings, pages, types) - Arnav Balyan - FSST - Jiayi Wang - Databricks - Benjamin Owad - Snowflake (listening in) - Ashish Paliwal - Apple (listening in) - Jiaying Li = Snowflake (listening in) Agenda: - Parquet-java Release. - Err on the side of releasing often - Gang is helping to make the release - Ideally, this is more automated. Apache infra working on this. - Small group of release managers per project. - TODO: start 2 threads - Russel: release automation - Raising your hand to shepherd release scope definition. - [Andrew, Julien] Finalizing ALP (Floating Point Encoding for Parquet). - Mailing list: https://lists.apache.org/thread/cg68jco16ltqs6xrwphol5co8o2yjhpf - Andrew/Micah/Antoine reviewing the spec: https://github.com/apache/parquet-format/pull/557 - Parquet-format examples in https://github.com/apache/parquet-testing/pull/100 (Andrew thinks they are are quite large) - Needs: Reviews of C++ implementation and Java implementation - Andrew: will review Rust implementation “soon” https://github.com/apache/arrow-rs/pull/9372 - Hyparquet (javascript) ALP branch: https://github.com/hyparam/hyparquet/pull/161 - Roll out plan: - Using ALP is behind a flag to enable read but write - Model of fairly granular releases? 1 thing at a time. Example of iceberg model. - Here is the current implementation status page (notes dates when implementations supported the feature, not versions): https://parquet.apache.org/docs/file-format/implementationstatus/ - See “Minimum Version for Read Support by Year” table as the current “state of the art” - Decision points: - Is it linear or not: supporting Vx, means supporting everything before - Feature flags that turn into a v. - TODO: - Thread: - Review existing process voted upon https://lists.apache.org/thread/nq7n6pbp222txrfo232ybgpvlvpmykbp and see what is missing - Clarify the feature flag as a standard across implementations. - Clarify how alp becomes on by default. - Not a blocker to releasing ALP behind a flag. - Pending validation: - C++ unsigned ints. => need more test on the java perspective. - Parquet metadata (footer work) progress. - Jiayi: experiments, making every field modular - Flatbuff might not be necessary. - TODO: - Discuss further in the working group. - Update on the list. - [Weeks] Parquet: Non-contiguous Pages <https://docs.google.com/document/d/1nntcYM98PFSkHT70RexSBPtCnWqg1uRJ5_7m--ZgbsA> - TODO: look at the proposal posted on the list above. - There is interest in the project to address this head on. - The problem of asymmetric column size is not new. - [Rok] Discuss new vector-like datatype proposal - Parquet fixed-size list type <https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?tab=t.0> or reply on the ML https://lists.apache.org/thread/rolncdtobpmdqmqcr3ry087yhfw210l3 - 3-4 proposals discussed so far. - The doc has an analysis of the pros and cons of each. - TODO: - Please comment in the doc with feedback. We’ll discuss next time with the goal of making a decision. - Preferred option: New vector_repetition type. - Improve on read and write. Not writing repetition_level. Still has nullability info (optional). - Daniel to ask question in the doc regarding whether this work well with encodings. - O(1) read constraint? - [Ismael] Java encoding/decoding performance. - 15 PRs (5 more yay!), 3 approved, 5 in progress, 7 to be reviewed - https://github.com/apache/parquet-java/issues/3530 - PRs have been broken up for easier and more efficient review. - TODO: - Need reviews! Pretty please 🙂 - Merge approved PRs. - Feel free to reply to the thread on the ML. - [Martin] Datasets Project - https://github.com/spiraldb/raincloud - Make it easy to evaluate file formats in a reproducible framework with public datasets. - Make sure to cover all types and encodings for validating Parquet. Not necessarily scale like TPCH. - Don’t want to redistribute someone else's crawled data because of licensing constraints. - Future goal to generate good parquet datasets for evaluation. - Feedback welcome on where this is going next. - Ex: Not pick one implementation over another. - [Will E] effort to add defensive validating in mainstream open source readers? - Have more formalized list of checks that readers should have to have better errors and dealing with forward compat and introducing breaking changes gracefully. On Wed, May 6, 2026 at 7:20 AM Julien Le Dem <[email protected]> wrote: > The next Parquet sync is today Wednesday May 6th at 10am PT - 1pm ET - > 7pm CET (in ~2.5h) > > To join the invite, join the group: > https://groups.google.com/g/apache-parquet-community-sync > > Everybody is welcome, bring your topic or just listen in. > > (Some more details on how the meeting is run: > https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t ) >
