Re: Next Parquet sync tomorrow Wednesday December 10th

Julien Le Dem Wed, 10 Dec 2025 17:47:19 -0800

Notes:
https://docs.google.com/document/d/e/2PACX-1vSDHW7gvG8eO6aIxaIVPrZSqYYhtRDb5W1imnbpM4QRYNPsTwEO1fU5z7SEhVIFa4YqWJeSRJ9tcXYS/pub


Attendees:

   -

   Julien - Datadog - status of proposals, columns in separate files
   -

   Daniel Weeks - Databricks (encodings, footers, separate files)
   -

   Andrew Lamb - InfluxData, listening in
   -

   Alkis - Databricks, new footer PR for arrow-cpp
   <https://github.com/apache/arrow/pull/48431>, listening in
   -

   Kenny Daniel - Hyperparam (hyparquet)
   -

   Jeff - databricks - just listening in
   -

   Micah Kornfield - Databricks - Website updates/Doc Spam
   -

   Vinoo - Keru.ai - just listening in
   -

   Michael Chavinda - DataHaskell - listening
   -

   Prateek Gaur - Snowflake
   -

   Arnav Balyan - Uber
   -

   Adrian Garcia Badaracco - Pydantic - listening in
   -

   Raúl Cumplido - QuantStack - listening in
   -

   Martin Prammer - CMU - Variant Retrospective

Notes/Agenda

   -

   ALP
   -

      Prepare for review early next week
      -

      Open questions:
      -

         What goes in the page header itself. (value count, …)
         -

      Adding flame graph. Arrow code bit packer is slightly slower than
      initial POC
      -

   PFOR
   -

      Started from FastPFOR, comparing to existing
      -

         Github repo. (does it have incompatible license to ASL?) => we
         should start from the spec in the Paper to avoid license contamination.
         -

         Similar to ALP patching.
         -

      Decompression faster than existing pqt encodings and zstd. 2-3x
      better.
      -

      Can add more datasets if requested.
      -

      Does bit packing need to be extensible? (applies to ALP as well.
      Anything that encodes values to smaller ints)
      -

      Better compression and better random access. (to delta-bit-packed,
      which has more value dependencies)
      -

      Alkis: We should go to fastlanes directly instead of starting with
      simple bitpacking.
      -

      Prateek:
      -

         fastllanes is a replacement to bitpacking and not the algorithm
         pfor.
         -

         so pfor with a fastlanes layout will be the best encoding
         -

         fastlanes with deltabitpack won't work
         -

      TODO: Alkis and Prateek to follow up
      -

   Cascaded encodings
   -

      For future. Is not a requirement for adding other encodings.
      -

   FSST
   -

      Draft PR out. Thank you Micah and Gang for reviews.
      -

      Micah: Need to finalize the spec before we start a cpp impl.
      -

      TODO: Arnav to refine spec and get feedback on the mailing list.
      -

   Encryption:
   -

      Adding support for encryption in Parquet CLI.
      -

   Doc Spam:
   -

      Needed to turn off open comments because of spam
      -

      Please reach out if you need to comment on docs.
      -

   Website updates:
   -

      We don’t duplicate the metadata file in the websites anymore.
      Parquet-format is the source of truth. (added missing types: geo
      <https://parquet.apache.org/docs/file-format/types/geospatial/>,
      variant
      <https://parquet.apache.org/docs/file-format/types/variantencoding/>…)
      -

      Move the implementation of implementation status to data driven
      -

         Json vs yaml.
         -

      Versioning scheme.
      -

         Give a set of features that go together.
         -

   V2 - updates
   -

      Proposal (PR <https://github.com/apache/parquet-format/pull/535>) to
      make it clear what is recommended V1 vs V2. (pages are
supported, encodings
      are supported)
      -

      Should V2 be default?
      -

      V2:
      -

         mandates to stop on row boundaries.
         -

            Can be trickier to write
            -

         RL, DL are always uncompressed.
         -

         Sometimes we wanted.
         -

      Should we create a v3 that is a superset of both and simplify?
      -

         Includes cascaded encodings
         -

         THIS IS LOWER PRIORITY THAN NEW ENCODINGS!
         -

      TODO: Hold a vote on the official status of V2. [Micah to start it]
      -

   Columns in separate files
   -

      Option1: deprecate pqt metadata for column in separate file.
      -

         We cannot remove from the spec because used in _metadata file
         -

         This might be used in multimodal AI use cases.
         -

      We’ll need to have discussions on what goes in file format vs table
      format.
      -

      Use cases:
      -

         disproportionate column
         -

         Column append.
         -

      Implementation status page on this topic is no clear or accurate
      -

      Let’s not deprecate it just yet but tell people not to use it.
      -

         TODO[Micah]: add comment: not used and not supported. We should
         not start using it without a formal proposal.
         -

      TODO: start a doc with use cases and constraints?
      -

   Variant Retrospective
   -

      Martin: CMU working on blog post on collab between OSS, Academia and
      industry. (Java, Rust, Go)
      -

   New footer:
   -

      Published PR <https://github.com/apache/arrow/pull/48431> to Arrow cpp
      -

      TODO: PR for flatbuf spec in parquet-format.
      -

   Should some of the new encodings be supported in arrow to enable
   processing on encoded values? (ex FSST, ALP, …).
   -

      TODO: Julien to email the Arrow list


On Tue, Dec 9, 2025 at 5:37 PM Julien Le Dem <[email protected]> wrote:

> The next Parquet sync is tomorrow Wednesday Dec 10th at 10am PT - 1pm ET -
> 7pm CET
>
> To join the invite, join the group:
> https://groups.google.com/g/apache-parquet-community-sync
>
> Everybody is welcome, bring your topic or just listen in.
>
> (Some more details on how the meeting is run:
> https://lists.apache.org/thread/bjdkscmx7zvgfbw0wlfttxy8h6v3f71t )
>
>

Re: Next Parquet sync tomorrow Wednesday December 10th

Reply via email to