Following up on my action item, I have created the parquet-benchmark repo: https://github.com/apache/parquet-benchmark
On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem <[email protected]> wrote: > Attendees: > > - > > Micah: Google, no special topic today > - > > Alkis: Databricks, storage stack. Topic: Parquet extension PR so that > we can go in the format. Want to fix the metadata to make it work for wide > schemas. > - > > Vinoo: Palantir -> startup in data space. Working on improving the > website. > - > > Julien: Datadog. Topic: Make parquet reading possible to be done > sequentially (as opposed to footer first) > - > > Rok: Voltron -> freelance in Fintech. Care about Parquet performance. > Have time to contribute to footers (“V3”). > > > Follow up items: > > Mika’s Parquet format changes process > > - > > First PR merged, need to finalize java > - > > => Mostly done > > Jira -> github migration > > - > > Getting started with github. Will follow up on the mailing list. > - > > => mostly closed discussion. Some follow up async on the discussion. > > > Agenda: > > - > > Finalizing [EXTERNAL] Parquet extensions > > <https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6> > > - > > AI: Alkis Evlogimenos <[email protected]> to update > PR with everything in the doc except Alternatives Considered and split > the > examples in another page. > - > > New footer metadata discussion. > > > Discussion: > > - > > Extensions: > - > > Add functionality to read/write the extension and show that we can > ignore it. > - > > 1: write an extension and read the old footer that ignores it. > - > > 2: write extension and allow reading it back. > - > > New metadata: > - > > Flatbuffer is bigger than thrift: need to optimize metadata > - > > Start from a 1-1 implementation to existing footer and keep > iterating 1 commit at a time. > - > > Would like to have a branch in github arrow cpp or a public fork on > github to share the prototype. > - > > Add to parquet-tool to print the footer. > - > > Add utility to obfuscate schema so that people can share their > metadata without sharing proprietary information. > - > > That way we can have data about slow footers and validate we can > read faster with the new footer > - > > => creation of a database of footers. > - > > Getting a feel of what features are used by users. > - > > Alkis would want to share his findings through a blog post. > - > > Also need to make sure the addition of the new footer doesn’t > impact old footers too much. > - > > Possibly: > - > > Codspeed for performance testing > - > > Thrift linter: https://github.com/thrift-labs/thrift-fmt > - > > AI: > - > > [Julien] Create a parquet-benchmark repo for a footer db and > other things > - > > Example: https://github.com/rok/parquet-benchmark > - > > Alkis to pick where on github to push his prototype branch > - > > Follow up on: > - > > https://github.com/apache/parquet-format/pull/445 > >
