Re: [DISCUSS] flatbuf footer

Personal Mon, 20 Oct 2025 04:23:39 -0700

Apologies if this has already been discussed, but have we considered simply 
storing these flatbuffers as separate files alongside existing parquet files. I 
think this would have a number of quite compelling advantages:


- no breaking format changes, all readers can continue to still read all 
parquet files
- people can generate these "index" files for existing datasets without having 
to rewrite all their files
- older and newer readers can coexist - no stop the world migrations
- can potentially combine multiple flatbuffers into a single file for better IO 
when scanning collections of files - potentially very valuable for object 
stores, and would also help for people on HDFS and other systems that struggle 
with small files
- could potentially even inline these flatbuffers into catalogs like iceberg
- can continue to iterate at a faster rate, without the constraints of needing 
to move in lockstep with parquet versioning
- potentially less confusing for users, parquet files are still the same, they 
just can be accelerated by these new index files

This would mean some data duplication, but that seems a small price to pay, and 
would be strictly opt-in for users with use-cases that justify it?

Kind Regards,

Raphael

On 20 October 2025 11:08:59 BST, Alkis Evlogimenos 
<[email protected]> wrote:
>>
>> Thank you, these are interesting. Can you share instructions on how to
>> reproduce the reported numbers? I am interested to review the code used to
>> generate these results (esp the C++ thrift code)
>
>
>The numbers are based on internal code (Photon). They are not very far off
>from https://github.com/apache/arrow/pull/43793. I will update that PR in
>the coming weeks so that we can repro the same benchmarks with open source
>code too.
>
>On Fri, Oct 17, 2025 at 5:52 PM Andrew Lamb <[email protected]> wrote:
>
>> Thanks Alkis, that is interesting data.
>>
>> > We found that the reported numbers were not reproducible on AWS instances
>>
>> I just updated the benchmark results[1] to include results from
>> AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
>> run on my 2023 Mac laptop)
>>
>> > You can find the summary of our findings in a separate tab in the
>> proposal document:
>>
>> Thank you, these are interesting. Can you share instructions on how to
>> reproduce the reported numbers? I am interested to review the code used to
>> generate these results (esp the C++ thrift code)
>>
>> Thanks
>> Andrew
>>
>>
>> [1]:
>>
>> https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux
>>
>>
>> On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
>> <[email protected]> wrote:
>>
>> > Thank you Andrew for putting the code in open source so that we can repro
>> > it.
>> >
>> > We have run the rust benchmarks and also run the flatbuf proposal with
>> our
>> > C++ thrift parser, the flatbuf footer with Thrift conversion, the
>> > flatbuf footer without Thrift conversion, and the flatbuf footer
>> > without Thrift conversion and without verification. You can find the
>> > summary of our findings in a separate tab in the proposal document:
>> >
>> >
>> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s
>> >
>> > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
>> > optimized Thrift parsing. It also remains faster than the Thrift parser
>> > even if the Thrift parser skips statistics. Furthermore if Thrift
>> > conversion is skipped, the speedup is 50x, and if verification is skipped
>> > it goes beyond 150x.
>> >
>> >
>> > On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb <[email protected]>
>> > wrote:
>> >
>> > > Hello,
>> > >
>> > > I did some benchmarking for the new parser[2] we are working on in
>> > > arrow-rs.
>> > >
>> > > This benchmark achieves nearly an order of magnitude improvement (7x)
>> > > parsing Parquet metadata with no changes to the Parquet format, by
>> simply
>> > > writing a more efficient thrift decoder (which can also skip
>> statistics).
>> > >
>> > > While we have not implemented a similar decoder in other languages such
>> > as
>> > > C/C++ or Java, given the similarities in the existing thrift libraries
>> > and
>> > > usage, we expect similar improvements are possible in those languages
>> as
>> > > well.
>> > >
>> > > Here are some inline images:
>> > > [image: image.png]
>> > > [image: image.png]
>> > >
>> > >
>> > > You can find full details here [1]
>> > >
>> > > Andrew
>> > >
>> > >
>> > > [1]: https://github.com/alamb/parquet_footer_parsing
>> > > [2]: https://github.com/apache/arrow-rs/issues/5854
>> > >
>> > >
>> > > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl <[email protected]> wrote:
>> > >
>> > >> > Concerning Thrift optimization, while a 2-3x improvement might be
>> > >> > achievable, Flatbuffers are currently demonstrating a 10x
>> improvement.
>> > >> > Andrew, do you have a more precise estimate for the speedup we could
>> > >> expect
>> > >> > in C++?
>> > >>
>> > >> Given my past experience on cuDF, I'd estimate about 2X there as well.
>> > >> cuDF has it's own metadata parser that I once benchmarked against the
>> > >> thrift generated parser.
>> > >>
>> > >> And I'd point out that beyond the initial 2X improvement, rolling your
>> > >> own parser frees you of having to parse out every structure in the
>> > metadata.
>> > >>
>> > >
>> >
>>

Re: [DISCUSS] flatbuf footer

Reply via email to