Following up on my action item, I have created the parquet-benchmark repo:
https://github.com/apache/parquet-benchmark

On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem <[email protected]> wrote:

> Attendees:
>
>    -
>
>    Micah: Google, no special topic today
>    -
>
>    Alkis: Databricks, storage stack. Topic: Parquet extension PR so that
>    we can go in the format. Want to fix the metadata to make it work for wide
>    schemas.
>    -
>
>    Vinoo: Palantir -> startup in data space. Working on improving the
>    website.
>    -
>
>    Julien: Datadog. Topic: Make parquet reading possible to be done
>    sequentially (as opposed to footer first)
>    -
>
>    Rok: Voltron -> freelance in Fintech. Care about Parquet performance.
>    Have time to contribute to footers (“V3”).
>
>
> Follow up items:
>
> Mika’s Parquet format changes process
>
>    -
>
>    First PR merged, need to finalize java
>    -
>
>    => Mostly done
>
> Jira -> github migration
>
>    -
>
>    Getting started with github. Will follow up on the mailing list.
>    -
>
>    => mostly closed discussion. Some follow up async on the discussion.
>
>
> Agenda:
>
>    -
>
>    Finalizing [EXTERNAL] Parquet extensions
>    
> <https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6>
>
>    -
>
>       AI: Alkis Evlogimenos <[email protected]> to update
>       PR with everything in the doc except Alternatives Considered and split 
> the
>       examples in another page.
>       -
>
>    New footer metadata discussion.
>
>
> Discussion:
>
>    -
>
>    Extensions:
>    -
>
>       Add functionality to read/write the extension and show that we can
>       ignore it.
>       -
>
>          1: write an extension and read the old footer that ignores it.
>          -
>
>          2: write extension and allow reading it back.
>          -
>
>    New metadata:
>    -
>
>       Flatbuffer is bigger than thrift: need to optimize metadata
>       -
>
>          Start from a 1-1 implementation to existing footer and keep
>          iterating 1 commit at a time.
>          -
>
>       Would like to have a branch in github arrow cpp or a public fork on
>       github to share the prototype.
>       -
>
>       Add to parquet-tool to print the footer.
>       -
>
>          Add utility to obfuscate schema so that people can share their
>          metadata without sharing proprietary information.
>          -
>
>          That way we can have data about slow footers and validate we can
>          read faster with the new footer
>          -
>
>          => creation of a database of footers.
>          -
>
>       Getting a feel of what features are used by users.
>       -
>
>          Alkis would want to share his findings through a blog post.
>          -
>
>       Also need to make sure the addition of the new footer doesn’t
>       impact old footers too much.
>       -
>
>       Possibly:
>       -
>
>          Codspeed for performance testing
>          -
>
>          Thrift linter: https://github.com/thrift-labs/thrift-fmt
>          -
>
>       AI:
>       -
>
>          [Julien] Create a parquet-benchmark repo for a footer db and
>          other things
>          -
>
>             Example: https://github.com/rok/parquet-benchmark
>             -
>
>          Alkis to pick where on github to push his prototype branch
>          -
>
>          Follow up on:
>          -
>
>             https://github.com/apache/parquet-format/pull/445
>
>

Reply via email to