> Andrew, do you have a more precise estimate for the speedup we could
expect
in C++?

I do not yet, but I will try and find out. I have filed an issue[1] to
track the question / will try and enlist some help.

It will be fun to benchmaxx our new parser

Andrew

[1]: https://github.com/apache/arrow-rs/issues/8441

On Wed, Sep 24, 2025 at 6:38 AM Alkis Evlogimenos
<[email protected]> wrote:

> Thank you all for taking the time to go through the doc and your feedback.
> I'd like to address some of the key points raised:
>
> Regarding nested Flatbuffers, there's no parsing benefit to using them. In
> the current prototype, approximately two-thirds of the decoding cost comes
> from converting the Flatbuffer to `FileMetadata` (the Thrift object) to
> simplify the rollout process. Even with this conversion, we're observing a
> greater than 10x improvement in footer decoding time for footers that
> perform poorly with Thrift (at the p999 percentile). Removing the
> `FileMetadata` translation should easily provide another 2x speedup.
>
> Concerning Thrift optimization, while a 2-3x improvement might be
> achievable, Flatbuffers are currently demonstrating a 10x improvement.
> Andrew, do you have a more precise estimate for the speedup we could expect
> in C++? It's also important to note that Thrift's format does not allow for
> random access, meaning we will always have to parse the entire footer,
> regardless of which columns are requested.
>
> I will work on getting numbers for LZ4 compressed versus raw footers, but
> please be aware that this will take some time.
>
> Finally, the 32-bit narrowing of row group sizes appears to be the most
> contentious aspect of the design. I suggest we discuss this live during our
> next Parquet sync. For the record, shrinking the offsets is the second most
> significant optimization for Flatbuffer footer size, with statistics being
> the first.
>
> See you all in the next sync.
>
>
> On Wed, Sep 17, 2025 at 10:02 AM Antoine Pitrou <[email protected]>
> wrote:
>
> >
> > Hi Andrew,
> >
> > I haven't heard of anything like this for C++, but it is an intriguing
> > idea.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Tue, 16 Sep 2025 16:44:14 -0400
> > Andrew Lamb <[email protected]>
> > wrote:
> > > Has anyone spent time optimizing the thrift decoder (e.g. not just use
> > > whatever a general purpose thrift compiler generates, but custom code a
> > > parser just for Parquet metadata)?
> > >
> > > Ed is in the process of implementing just such a decoder in arrow-rs[1]
> > and
> > > has seen a 2-3x performance improvement (with no change to the format)
> in
> > > early benchmark results. This is inline with our earlier work on the
> > > topic[2] where we estimated there is a 2-4x performance improvement
> with
> > > implementation improvements alone.
> > >
> > > Andrew
> > >
> > > [1]: https://github.com/apache/arrow-rs/issues/5854
> > > [2]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
> > >
> > > On Tue, Sep 16, 2025 at 4:12 AM Antoine Pitrou <
> > [email protected]> wrote:
> > >
> > > >
> > > > Hi again,
> > > >
> > > > Ok, a quick summary of my current feedback on this:
> > > >
> > > > - decoding speed measurements are given, but not footer size
> > > >   measurements; it would be interesting to have both
> > > >
> > > > - it's not obvious whether the stated numbers are for reading all
> > > >   columns or a subset of them
> > > >
> > > > - optional LZ4 compression is mentioned, but no numbers are given for
> > > >   it; it would be nice if numbers were available for both
> uncompressed
> > > >   and compressed footers
> > > >
> > > > - the numbers seem quite underwhelming currently, I think most of us
> > > >   were expecting massive speed improvements given past discussions
> > > >
> > > > - I'm firmly against narrowing sizes to 32 bits; making the footer
> more
> > > >   compact is useful, but not to the point of reducing usefulness or
> > > >   generality
> > > >
> > > >
> > > > A more general proposal: given the slightly underwhelming perf
> > > > numbers, has nested Flatbuffers been considered as an alternative?
> > > >
> > > > For example, the RowGroup table could become:
> > > > ```
> > > > table ColumnChunk {
> > > >   file_path: string;
> > > >   meta_data: ColumnMetadata;
> > > >   // etc.
> > > > }
> > > >
> > > > struct EncodedColumnChunk {
> > > >   // Flatbuffers-encoded ColumnChunk, to be decoded/validated
> > indidually
> > > >   column: [ubyte];
> > > > }
> > > >
> > > > table RowGroup {
> > > >   columns: [EncodedColumnChunk];
> > > >   total_byte_size: int;
> > > >   num_rows: int;
> > > >   sorting_columns: [SortingColumn];
> > > >   file_offset: long;
> > > >   total_compressed_size: int;
> > > >   ordinal: short = null;
> > > > }
> > > > ```
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> > > > On Thu, 11 Sep 2025 08:41:34 +0200
> > > > Alkis Evlogimenos
> > > > <[email protected]>
> > > > wrote:
> > > > > Hi all. I am sharing as a separate thread the proposal for the
> footer
> > > > > change we have been working on:
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
> >
> > > > > .
> > > > >
> > > > > The proposal outlines the technical aspects of the design and the
> > > > > experimental results of shadow testing this in production
> workloads.
> > I
> > > > > would like to discuss the proposal's most salient points in the
> > next
> > > > sync:
> > > > > 1. the use of flatbuffers as footer serialization format
> > > > > 2. the additional limitations imposed on parquet files (row group
> > size
> > > > > limit, row group max num row limit)
> > > > >
> > > > > I would prefer comments on the google doc to facilitate async
> > discussion.
> > > > >
> > > > > Thank you,
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> >
>

Reply via email to