Parquet Sync Notes July 17th 2024

Attendees:

   -

   Parth: contributor to Apache Drill, Arrow-java. Currently working on
   DataFusion-Comet (Spark impl). Topics: What is the direction of Parquet
   “V3”. Can we make Parquet on S3 easier and better?
   -

   Micah: OSS Data formats for BigQuery at Google. Topics: Interested in
   keeping Parquet current. How do we manage incompatible changes and releases
   for Parquet java.
   -

   Gabor: Dremio, Parquet community member for a long time. Topics:
   Interested in the “Parquet V3” initiative
   -

   Alkis: Databricks storage stacks (scans, clouds). Worked at Google on
   fancy encodings in the past. Topics: Discussing Parquet metadata handling
   and new encodings. Specifically, the extension proposal: [EXTERNAL]
   Parquet extensions
   
<https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6>
   -

   Fokko: Tabular/Databricks. Topics: deprecation of old Hadoop and Java
   support. Jira -> github migration.
   -

   Ed: Lawrence Livermore: power user of Parquet. Topics: How to use
   Parquet with GPUs. Cudf. Parquet improvement process.
   -

   Julien: Datadog, very early committer/founder. Topics: Interested in
   making Parquet current, new encodings. How do we solve footer metadata
   scaling. How to make incompatible changes without breaking everyone. How to
   add new encodings but not too many. How to improve timeseries on top of
   parquet. How to merge sorted files in an efficient way.
   -

   Nong: Databricks, very early committer/founder. Topics: Interested in
   the V3 stuff.


Agenda: (built from topics above)

   -

   Direction of “Parquet V3”:
   -

      Parquet on S3?
      -

      Merging sorted files efficiently.
      -

      More efficient for GPUs?
      -

   Alkis’ extension proposal [EXTERNAL] Parquet extensions
   
<https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6>
   -

   Mika’s Parquet format changes process (we didn’t get to it)
   -

   Jira -> github migration (we didn’t get to it)


Notes:

   -

   Direction for Parquet improvement:
   -

      Rolling out incompatible changes:
      -

         Duplicate information until we’re satisfied: for metadata and for
         encodings as well
         -

      Better metadata:
      -

         Support for wide schemas
         -

      Better encodings:
      -

         More efficient for GPUs? => parallelizable
         -

         Better for time series. Delta encodings?
         -

         BTR blocks…
         -

      Merge Sorted files on the fly:
      -

         Ability to read the file from the beginning as a whole. Since
         we’re reading the whole file anyway
         -

         Side effect of being able to ignore the footer and still read the
         file in case of corruption.
         -

         Log-structure merge of parquet files. Ed: interested in doing this
         for GPUs. Need better prediction of size of data to load into
GPU memory.
         -

   [EXTERNAL] Parquet extensions
   
<https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6>
   :
   -

      Need an extension mechanism to experiment with a new footer or new
      metadata.
      -

      In a fully forward/backward manner.
      -

      Proposed:
      -

         Reserved a binary field in thrift that will not be used by the
         spec.
         -

            Thrift will ignore field ids that are unknown and therefore old
            readers will just ignore it.
            -

      Discussion:
      -

         2 use cases:
         -

            Proprietary extensions
            -

            New Parquet footer.
            -

         Remove the notion of vendor to focus on the path to migrate to the
         new footer.
         -

      Action Items:
      -

         Alkis to integrate feedback in his proposal and follow up on the
         list to finalize
         -

   Action Items:
   -

      Since we didn’t get to 2 items, we agreed to change the meeting to
      bi-weekly for now until it’s not needed. The next meeting is
July 31st at 9:30am
      PT - 12:30pm ET - 6:30pm CET
      -

         Join here: https://calendar.app.google/UMMmbUV1JMh7ffGt6
         -

      Follow up on the items we didn’t get to on the mailing list:
      -

         Mika’s proposal for the Parquet format changes process
         -

         Jira -> github migration

Reply via email to