Re: flatbuffer metadata: work-in-progress

Julien Le Dem Thu, 22 Aug 2024 17:46:00 -0700

this looks great,
thank you for sharing.


On Thu, Aug 22, 2024 at 10:42 AM Alkis Evlogimenos
<[email protected]> wrote:

> Hey folks.
>
> As promised I pushed a PR to the main repo with my attempt to use
> flatbuffers for metadata for parquet:
> https://github.com/apache/arrow/pull/43793
>
> The PR builds on top of the metadata extensions in parquet
> https://github.com/apache/parquet-format/pull/254 and tests how fast we
> can
> parse thrift, thrift+flatbuf, flatbuf alone and also how much time it takes
> to encode flatbuf. In addition at the start of the benchmark it prints out
> the number of row groups/column chunks and thrift/flatbuffer serialized
> bytes.
>
> I structured the commits to contain one optimization each to make their
> effects more visible. I have tracked the progress at the top of the
> benchmark
> <
> https://github.com/apache/arrow/blob/7f550da9980491a4167318db084e1b50cb100b0f/cpp/src/parquet/metadata3_benchmark.cc#L34-L129
> >
> .
>
> The current state is complete sans encryption support. All the bugs are
> mine but ideas are coming from a few folks inside Databricks. As expected
> parsing the thrift+extension footer incurs a very small regression (~1%).
> Parsing/verifying flatbuffers is >20x faster than thrift so I haven't tried
> to make changes to its structure for speed. In the last commit the size of
> flatbuffer metadata is anywhere from slightly smaller to more than 4x
> smaller (!!!).
>
> Unfortunately I can't share the footers I used yet. I am going to wait for
> donations <https://github.com/apache/parquet-benchmark/pull/1> to the
> parquet-benchmarks repository and rerun the benchmark against them.
>
> I would like to invite anyone interested in collaborating to take a look at
> the PR, consider the design decisions made, experiment with it, and
> contribute.
>
> Thank you!
>

Re: flatbuffer metadata: work-in-progress

Reply via email to