Notes: https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub
Attendees: - Julien - Datadog - status of proposals, columns in separate files - Daniel Weeks - Databricks (encodings, footers, separate files) - Andrew Lamb - InfluxData, listening in - Alkis - Databricks, new footer PR for arrow-cpp <https://github.com/apache/arrow/pull/48431>, listening in - Kenny Daniel - Hyperparam (hyparquet) - Jeff - databricks - just listening in - Micah Kornfield - Databricks - Website updates/Doc Spam - Vinoo - Keru.ai - just listening in - Michael Chavinda - DataHaskell - listening - Prateek Gaur - Snowflake - Arnav Balyan - Uber - Adrian Garcia Badaracco - Pydantic - listening in - Raúl Cumplido - QuantStack - listening in - Martin Prammer - CMU - Variant Retrospective Notes/Agenda - ALP - Prepare for review early next week - Open questions: - What goes in the page header itself. (value count, …) - Adding flame graph. Arrow code bit packer is slightly slower than initial POC - PFOR - Started from FastPFOR, comparing to existing - Github repo. (does it have incompatible license to ASL?) => we should start from the spec in the Paper to avoid license contamination. - Similar to ALP patching. - Decompression faster than existing pqt encodings and zstd. 2-3x better. - Can add more datasets if requested. - Does bit packing need to be extensible? (applies to ALP as well. Anything that encodes values to smaller ints) - Better compression and better random access. (to delta-bit-packed, which has more value dependencies) - Alkis: We should go to fastlanes directly instead of starting with simple bitpacking. - Prateek: - fastllanes is a replacement to bitpacking and not the algorithm pfor. - so pfor with a fastlanes layout will be the best encoding - fastlanes with deltabitpack won't work - TODO: Alkis and Prateek to follow up - Cascaded encodings - For future. Is not a requirement for adding other encodings. - FSST - Draft PR out. Thank you Micah and Gang for reviews. - Micah: Need to finalize the spec before we start a cpp impl. - TODO: Arnav to refine spec and get feedback on the mailing list. - Encryption: - Adding support for encryption in Parquet CLI. - Doc Spam: - Needed to turn off open comments because of spam - Please reach out if you need to comment on docs. - Website updates: - We don’t duplicate the metadata file in the websites anymore. Parquet-format is the source of truth. (added missing types: geo <https://parquet.apache.org/docs/file-format/types/geospatial/>, variant <https://parquet.apache.org/docs/file-format/types/variantencoding/>…) - Move the implementation of implementation status to data driven - Json vs yaml. - Versioning scheme. - Give a set of features that go together. - V2 - updates - Proposal (PR <https://github.com/apache/parquet-format/pull/535>) to make it clear what is recommended V1 vs V2. (pages are supported, encodings are supported) - Should V2 be default? - V2: - mandates to stop on row boundaries. - Can be trickier to write - RL, DL are always uncompressed. - Sometimes we wanted. - Should we create a v3 that is a superset of both and simplify? - Includes cascaded encodings - THIS IS LOWER PRIORITY THAN NEW ENCODINGS! - TODO: Hold a vote on the official status of V2. [Micah to start it] - Columns in separate files - Option1: deprecate pqt metadata for column in separate file. - We cannot remove from the spec because used in _metadata file - This might be used in multimodal AI use cases. - We’ll need to have discussions on what goes in file format vs table format. - Use cases: - disproportionate column - Column append. - Implementation status page on this topic is no clear or accurate - Let’s not deprecate it just yet but tell people not to use it. - TODO[Micah]: add comment: not used and not supported. We should not start using it without a formal proposal. - TODO: start a doc with use cases and constraints? - Variant Retrospective - Martin: CMU working on blog post on collab between OSS, Academia and industry. (Java, Rust, Go) - New footer: - Published PR <https://github.com/apache/arrow/pull/48431> to Arrow cpp - TODO: PR for flatbuf spec in parquet-format. - Should some of the new encodings be supported in arrow to enable processing on encoded values? (ex FSST, ALP, …). - TODO: Julien to email the Arrow list On Tue, Dec 9, 2025 at 5:37 PM Julien Le Dem <[email protected]> wrote: > The next Parquet sync is tomorrow Wednesday Dec 10th at 10am PT - 1pm ET - > 7pm CET > > To join the invite, join the group: > https://groups.google.com/g/apache-parquet-community-sync > > Everybody is welcome, bring your topic or just listen in. > > (Some more details on how the meeting is run: > https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t ) > >
