Re: [DISCUSS] flatbuf footer

Raphael Taylor-Davies Mon, 20 Oct 2025 07:49:56 -0700

I don't disagree that two files is much harder than one file, but isthat the use-case that the flatbuffer format is solving for, or is thatadequately serviced by the existing thrift-based footer? I hadinterpreted the flatbuffer more as a way to accelerate larger datasetsconsisting of many files, and of less utility for the single-file use-case.

That being said I misread the proposal, I thought it was proposingreplacing the thrift based footer with a flatbuffer, which would be verydisruptive. However, it looks like instead the (new?) proposal is tojust create a duplicate flatbuffer footer embedded within the thriftfooter, which can just be ignored by readers. The proposal is a bitvague when it comes to whether all information would be duplicated, orwhether some information would only be embedded in the flatbufferpayload, but presuming it is a true duplicate, many of my points don'tapply.


Kind Regards,

Raphael

On 20/10/2025 15:28, Antoine Pitrou wrote:

I don't think it's a "small price to pay". Parquet files are widely
used to share or transfer data of all kinds (in a way, they replace CSV
with much better characteristics). Sharing a single file is easy,
sharing two related files while keeping their relationship intact is an
order of magnitude more difficult.

Regards

Antoine.


On Mon, 20 Oct 2025 12:23:20 +0100
Personal
<[email protected]>
wrote:

Apologies if this has already been discussed, but have we considered simply 
storing these flatbuffers as separate files alongside existing parquet files. I 
think this would have a number of quite compelling advantages:

- no breaking format changes, all readers can continue to still read all 
parquet files
- people can generate these "index" files for existing datasets without having 
to rewrite all their files
- older and newer readers can coexist - no stop the world migrations
- can potentially combine multiple flatbuffers into a single file for better IO 
when scanning collections of files - potentially very valuable for object 
stores, and would also help for people on HDFS and other systems that struggle 
with small files
- could potentially even inline these flatbuffers into catalogs like iceberg
- can continue to iterate at a faster rate, without the constraints of needing 
to move in lockstep with parquet versioning
- potentially less confusing for users, parquet files are still the same, they 
just can be accelerated by these new index files

This would mean some data duplication, but that seems a small price to pay, and 
would be strictly opt-in for users with use-cases that justify it?

Kind Regards,

Raphael

On 20 October 2025 11:08:59 BST, Alkis Evlogimenos 
<[email protected]> wrote:

Thank you, these are interesting. Can you share instructions on how to
reproduce the reported numbers? I am interested to review the code used to
generate these results (esp the C++ thrift code)


The numbers are based on internal code (Photon). They are not very far off

>from https://github.com/apache/arrow/pull/43793. I will update that PR in

the coming weeks so that we can repro the same benchmarks with open source
code too.

On Fri, Oct 17, 2025 at 5:52 PM Andrew Lamb <[email protected]> wrote:

Thanks Alkis, that is interesting data.

We found that the reported numbers were not reproducible on AWS instances

I just updated the benchmark results[1] to include results from
AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
run on my 2023 Mac laptop)

You can find the summary of our findings in a separate tab in the

proposal document:

Thank you, these are interesting. Can you share instructions on how to
reproduce the reported numbers? I am interested to review the code used to
generate these results (esp the C++ thrift code)

Thanks
Andrew


[1]:

https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux


On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
<[email protected]> wrote:

Thank you Andrew for putting the code in open source so that we can repro
it.

We have run the rust benchmarks and also run the flatbuf proposal with

our

C++ thrift parser, the flatbuf footer with Thrift conversion, the
flatbuf footer without Thrift conversion, and the flatbuf footer
without Thrift conversion and without verification. You can find the
summary of our findings in a separate tab in the proposal document:

https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s

The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
optimized Thrift parsing. It also remains faster than the Thrift parser
even if the Thrift parser skips statistics. Furthermore if Thrift
conversion is skipped, the speedup is 50x, and if verification is skipped
it goes beyond 150x.


On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb <[email protected]>
wrote:

Hello,

I did some benchmarking for the new parser[2] we are working on in
arrow-rs.

This benchmark achieves nearly an order of magnitude improvement (7x)
parsing Parquet metadata with no changes to the Parquet format, by

simply

writing a more efficient thrift decoder (which can also skip

statistics).

While we have not implemented a similar decoder in other languages such

as

C/C++ or Java, given the similarities in the existing thrift libraries

and

usage, we expect similar improvements are possible in those languages

as

well.

Here are some inline images:
[image: image.png]
[image: image.png]


You can find full details here [1]

Andrew


[1]: https://github.com/alamb/parquet_footer_parsing
[2]: https://github.com/apache/arrow-rs/issues/5854


On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl <[email protected]> wrote:

Concerning Thrift optimization, while a 2-3x improvement might be
achievable, Flatbuffers are currently demonstrating a 10x

improvement.

Andrew, do you have a more precise estimate for the speedup we could

expect

in C++?

Given my past experience on cuDF, I'd estimate about 2X there as well.
cuDF has it's own metadata parser that I once benchmarked against the
thrift generated parser.

And I'd point out that beyond the initial 2X improvement, rolling your
own parser frees you of having to parse out every structure in the

metadata.

Re: [DISCUSS] flatbuf footer

Reply via email to