

   Alkis: Databricks storage and IO. goals: make Parquet metadata better
   for wide schemas and in general.

      Get pr in on parquet-benchmark

      Extensions PR in review

      Review ongoing footer experiments.

   Micah: Google. Listen in

   Rok: freelancing for fintech. Solving a problem related to encryption.
   Nothing to discuss yet. Interested in the wide schema tables. Started
   pushing for donating wide footer for parquet-benchmark.

   Julien: Datadog. Interested in metadata improvements

   Ashish: listening in.

   Gene: Databricks, main contributor to the Variant work. Topic: Where to
   put the spec?



   Ongoing Metadata tasks

   *Review Alkis’s footer experiments*

   *Variant type*



   Ongoing Metadata tasks:


   Get pr in on parquet-benchmark:

      Action Items:

         Micah: last review.

         Julien: Review and merge.

   Extensions PR in review:

      Goal to end the vote by the end of the week.

      Minimum 3 binding votes.

      Action Items:

         Micah: last review and vote


   *Review Alkis’s footer experiments*:

      Standard Google C++ benchmark:

         Add footer

         Convert footer

         Verify footer


         Make sure we don’t blow up the metadata

         Overhead of adding the new footer when not reading it.

      Collecting telemetry in Databricks to have more information on size
      of metadata (can we use smaller ints for sizes?).

         Can we limit the max size of a row group?

      Hierarchical definition in metadata?

         Move encodings to not be in footer but only with pages.

            General agreement on this

      Action Items:

         Alkis: will start a google doc from the benchmark to discuss the
         optimizations that are more controversial.

            Discuss limiting row group size to int 32

            Discuss removing stats from the footer or have two layers of

   *Variant type.*

      Arrow or Parquet are good hosting projects for that. It looks like
      Parquet makes more consensus.

      Logical encoding with fairly complex spec.

         Need separate jar.

            Consumable by ORC, Avro, …

      Currently: Spark holds the spec and the code

         Contribute Spec first: collect comments and make changes

         Code takes a little longer: need to refactor to separate from Spark

         There will be more than one implementation (or even more than one
         JVM impl)

      Parquet Cpp

         Current arrow impl combines IO and allocation in the library

         Would be better to have a separate lib that does not have IO nor

      Follow up:

         Gene to start a google doc to form a plan and will share on the



   Issue on compat testing:

On Wed, Aug 28, 2024 at 9:00 AM Julien Le Dem <> wrote:

> The next Parquet Sync is happening today at 9:30am PT - 12:30pm ET -
> 6:30pm CET
> (in 30min)
> To join the invite:
> Everybody is welcome, bring your topic or just listen in.
> Best
> Julien

Reply via email to