this looks great, thank you for sharing.
On Thu, Aug 22, 2024 at 10:42 AM Alkis Evlogimenos <[email protected]> wrote: > Hey folks. > > As promised I pushed a PR to the main repo with my attempt to use > flatbuffers for metadata for parquet: > https://github.com/apache/arrow/pull/43793 > > The PR builds on top of the metadata extensions in parquet > https://github.com/apache/parquet-format/pull/254 and tests how fast we > can > parse thrift, thrift+flatbuf, flatbuf alone and also how much time it takes > to encode flatbuf. In addition at the start of the benchmark it prints out > the number of row groups/column chunks and thrift/flatbuffer serialized > bytes. > > I structured the commits to contain one optimization each to make their > effects more visible. I have tracked the progress at the top of the > benchmark > < > https://github.com/apache/arrow/blob/7f550da9980491a4167318db084e1b50cb100b0f/cpp/src/parquet/metadata3_benchmark.cc#L34-L129 > > > . > > The current state is complete sans encryption support. All the bugs are > mine but ideas are coming from a few folks inside Databricks. As expected > parsing the thrift+extension footer incurs a very small regression (~1%). > Parsing/verifying flatbuffers is >20x faster than thrift so I haven't tried > to make changes to its structure for speed. In the last commit the size of > flatbuffer metadata is anywhere from slightly smaller to more than 4x > smaller (!!!). > > Unfortunately I can't share the footers I used yet. I am going to wait for > donations <https://github.com/apache/parquet-benchmark/pull/1> to the > parquet-benchmarks repository and rerun the benchmark against them. > > I would like to invite anyone interested in collaborating to take a look at > the PR, consider the design decisions made, experiment with it, and > contribute. > > Thank you! >
