I don't think it's a "small price to pay". Parquet files are widely used to share or transfer data of all kinds (in a way, they replace CSV with much better characteristics). Sharing a single file is easy, sharing two related files while keeping their relationship intact is an order of magnitude more difficult.
Regards Antoine. On Mon, 20 Oct 2025 12:23:20 +0100 Personal <[email protected]> wrote: > Apologies if this has already been discussed, but have we considered simply > storing these flatbuffers as separate files alongside existing parquet files. > I think this would have a number of quite compelling advantages: > > - no breaking format changes, all readers can continue to still read all > parquet files > - people can generate these "index" files for existing datasets without > having to rewrite all their files > - older and newer readers can coexist - no stop the world migrations > - can potentially combine multiple flatbuffers into a single file for better > IO when scanning collections of files - potentially very valuable for object > stores, and would also help for people on HDFS and other systems that > struggle with small files > - could potentially even inline these flatbuffers into catalogs like iceberg > - can continue to iterate at a faster rate, without the constraints of > needing to move in lockstep with parquet versioning > - potentially less confusing for users, parquet files are still the same, > they just can be accelerated by these new index files > > This would mean some data duplication, but that seems a small price to pay, > and would be strictly opt-in for users with use-cases that justify it? > > Kind Regards, > > Raphael > > On 20 October 2025 11:08:59 BST, Alkis Evlogimenos > <[email protected]> wrote: > >> > >> Thank you, these are interesting. Can you share instructions on how to > >> reproduce the reported numbers? I am interested to review the code used to > >> generate these results (esp the C++ thrift code) > > > > > >The numbers are based on internal code (Photon). They are not very far off > >from https://github.com/apache/arrow/pull/43793. I will update that PR in > >the coming weeks so that we can repro the same benchmarks with open source > >code too. > > > >On Fri, Oct 17, 2025 at 5:52 PM Andrew Lamb <[email protected]> wrote: > > > >> Thanks Alkis, that is interesting data. > >> > >> > We found that the reported numbers were not reproducible on AWS > >> > instances > >> > >> I just updated the benchmark results[1] to include results from > >> AWS m6id.8xlarge instance (and they are indeed about 2x slower than when > >> run on my 2023 Mac laptop) > >> > >> > You can find the summary of our findings in a separate tab in the > >> proposal document: > >> > >> Thank you, these are interesting. Can you share instructions on how to > >> reproduce the reported numbers? I am interested to review the code used to > >> generate these results (esp the C++ thrift code) > >> > >> Thanks > >> Andrew > >> > >> > >> [1]: > >> > >> https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux > >> > >> > >> On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos > >> <[email protected]> wrote: > >> > >> > Thank you Andrew for putting the code in open source so that we can repro > >> > it. > >> > > >> > We have run the rust benchmarks and also run the flatbuf proposal with > >> our > >> > C++ thrift parser, the flatbuf footer with Thrift conversion, the > >> > flatbuf footer without Thrift conversion, and the flatbuf footer > >> > without Thrift conversion and without verification. You can find the > >> > summary of our findings in a separate tab in the proposal document: > >> > > >> > > >> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s > >> > >> > > >> > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the > >> > optimized Thrift parsing. It also remains faster than the Thrift parser > >> > even if the Thrift parser skips statistics. Furthermore if Thrift > >> > conversion is skipped, the speedup is 50x, and if verification is skipped > >> > it goes beyond 150x. > >> > > >> > > >> > On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb <[email protected]> > >> > wrote: > >> > > >> > > Hello, > >> > > > >> > > I did some benchmarking for the new parser[2] we are working on in > >> > > arrow-rs. > >> > > > >> > > This benchmark achieves nearly an order of magnitude improvement (7x) > >> > > parsing Parquet metadata with no changes to the Parquet format, by > >> simply > >> > > writing a more efficient thrift decoder (which can also skip > >> statistics). > >> > > > >> > > While we have not implemented a similar decoder in other languages > >> > > such > >> > as > >> > > C/C++ or Java, given the similarities in the existing thrift libraries > >> > > > >> > and > >> > > usage, we expect similar improvements are possible in those languages > >> as > >> > > well. > >> > > > >> > > Here are some inline images: > >> > > [image: image.png] > >> > > [image: image.png] > >> > > > >> > > > >> > > You can find full details here [1] > >> > > > >> > > Andrew > >> > > > >> > > > >> > > [1]: https://github.com/alamb/parquet_footer_parsing > >> > > [2]: https://github.com/apache/arrow-rs/issues/5854 > >> > > > >> > > > >> > > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl <[email protected]> wrote: > >> > > > >> > >> > Concerning Thrift optimization, while a 2-3x improvement might be > >> > >> > achievable, Flatbuffers are currently demonstrating a 10x > >> improvement. > >> > >> > Andrew, do you have a more precise estimate for the speedup we > >> > >> > could > >> > >> expect > >> > >> > in C++? > >> > >> > >> > >> Given my past experience on cuDF, I'd estimate about 2X there as well. > >> > >> cuDF has it's own metadata parser that I once benchmarked against the > >> > >> thrift generated parser. > >> > >> > >> > >> And I'd point out that beyond the initial 2X improvement, rolling your > >> > >> own parser frees you of having to parse out every structure in the > >> > metadata. > >> > >> > >> > > > >> > > >> >
