>
> Thank you, these are interesting. Can you share instructions on how to
> reproduce the reported numbers? I am interested to review the code used to
> generate these results (esp the C++ thrift code)


The numbers are based on internal code (Photon). They are not very far off
from https://github.com/apache/arrow/pull/43793. I will update that PR in
the coming weeks so that we can repro the same benchmarks with open source
code too.

On Fri, Oct 17, 2025 at 5:52 PM Andrew Lamb <[email protected]> wrote:

> Thanks Alkis, that is interesting data.
>
> > We found that the reported numbers were not reproducible on AWS instances
>
> I just updated the benchmark results[1] to include results from
> AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
> run on my 2023 Mac laptop)
>
> > You can find the summary of our findings in a separate tab in the
> proposal document:
>
> Thank you, these are interesting. Can you share instructions on how to
> reproduce the reported numbers? I am interested to review the code used to
> generate these results (esp the C++ thrift code)
>
> Thanks
> Andrew
>
>
> [1]:
>
> https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux
>
>
> On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
> <[email protected]> wrote:
>
> > Thank you Andrew for putting the code in open source so that we can repro
> > it.
> >
> > We have run the rust benchmarks and also run the flatbuf proposal with
> our
> > C++ thrift parser, the flatbuf footer with Thrift conversion, the
> > flatbuf footer without Thrift conversion, and the flatbuf footer
> > without Thrift conversion and without verification. You can find the
> > summary of our findings in a separate tab in the proposal document:
> >
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s
> >
> > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
> > optimized Thrift parsing. It also remains faster than the Thrift parser
> > even if the Thrift parser skips statistics. Furthermore if Thrift
> > conversion is skipped, the speedup is 50x, and if verification is skipped
> > it goes beyond 150x.
> >
> >
> > On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb <[email protected]>
> > wrote:
> >
> > > Hello,
> > >
> > > I did some benchmarking for the new parser[2] we are working on in
> > > arrow-rs.
> > >
> > > This benchmark achieves nearly an order of magnitude improvement (7x)
> > > parsing Parquet metadata with no changes to the Parquet format, by
> simply
> > > writing a more efficient thrift decoder (which can also skip
> statistics).
> > >
> > > While we have not implemented a similar decoder in other languages such
> > as
> > > C/C++ or Java, given the similarities in the existing thrift libraries
> > and
> > > usage, we expect similar improvements are possible in those languages
> as
> > > well.
> > >
> > > Here are some inline images:
> > > [image: image.png]
> > > [image: image.png]
> > >
> > >
> > > You can find full details here [1]
> > >
> > > Andrew
> > >
> > >
> > > [1]: https://github.com/alamb/parquet_footer_parsing
> > > [2]: https://github.com/apache/arrow-rs/issues/5854
> > >
> > >
> > > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl <[email protected]> wrote:
> > >
> > >> > Concerning Thrift optimization, while a 2-3x improvement might be
> > >> > achievable, Flatbuffers are currently demonstrating a 10x
> improvement.
> > >> > Andrew, do you have a more precise estimate for the speedup we could
> > >> expect
> > >> > in C++?
> > >>
> > >> Given my past experience on cuDF, I'd estimate about 2X there as well.
> > >> cuDF has it's own metadata parser that I once benchmarked against the
> > >> thrift generated parser.
> > >>
> > >> And I'd point out that beyond the initial 2X improvement, rolling your
> > >> own parser frees you of having to parse out every structure in the
> > metadata.
> > >>
> > >
> >
>

Reply via email to