If we embed both a flat buffer footer and a thrift footer, will readers be able to completely skip the thrift footer to read the flat buffer footer? Or will they have to download / read both? Especially if they have to download the bytes for both I’m not sure how big the win will be, on object storage slow IO can be what dominates.
> On Oct 20, 2025, at 9:49 AM, Raphael Taylor-Davies > <[email protected]> wrote: > > I don't disagree that two files is much harder than one file, but is that the > use-case that the flatbuffer format is solving for, or is that adequately > serviced by the existing thrift-based footer? I had interpreted the > flatbuffer more as a way to accelerate larger datasets consisting of many > files, and of less utility for the single-file use-case. > > That being said I misread the proposal, I thought it was proposing replacing > the thrift based footer with a flatbuffer, which would be very disruptive. > However, it looks like instead the (new?) proposal is to just create a > duplicate flatbuffer footer embedded within the thrift footer, which can just > be ignored by readers. The proposal is a bit vague when it comes to whether > all information would be duplicated, or whether some information would only > be embedded in the flatbuffer payload, but presuming it is a true duplicate, > many of my points don't apply. > > Kind Regards, > > Raphael > > On 20/10/2025 15:28, Antoine Pitrou wrote: >> I don't think it's a "small price to pay". Parquet files are widely >> used to share or transfer data of all kinds (in a way, they replace CSV >> with much better characteristics). Sharing a single file is easy, >> sharing two related files while keeping their relationship intact is an >> order of magnitude more difficult. >> >> Regards >> >> Antoine. >> >> >> On Mon, 20 Oct 2025 12:23:20 +0100 >> Personal >> <[email protected]> >> wrote: >>> Apologies if this has already been discussed, but have we considered simply >>> storing these flatbuffers as separate files alongside existing parquet >>> files. I think this would have a number of quite compelling advantages: >>> >>> - no breaking format changes, all readers can continue to still read all >>> parquet files >>> - people can generate these "index" files for existing datasets without >>> having to rewrite all their files >>> - older and newer readers can coexist - no stop the world migrations >>> - can potentially combine multiple flatbuffers into a single file for >>> better IO when scanning collections of files - potentially very valuable >>> for object stores, and would also help for people on HDFS and other systems >>> that struggle with small files >>> - could potentially even inline these flatbuffers into catalogs like iceberg >>> - can continue to iterate at a faster rate, without the constraints of >>> needing to move in lockstep with parquet versioning >>> - potentially less confusing for users, parquet files are still the same, >>> they just can be accelerated by these new index files >>> >>> This would mean some data duplication, but that seems a small price to pay, >>> and would be strictly opt-in for users with use-cases that justify it? >>> >>> Kind Regards, >>> >>> Raphael >>> >>> On 20 October 2025 11:08:59 BST, Alkis Evlogimenos >>> <[email protected]> wrote: >>>>> Thank you, these are interesting. Can you share instructions on how to >>>>> reproduce the reported numbers? I am interested to review the code used to >>>>> generate these results (esp the C++ thrift code) >>>> >>>> The numbers are based on internal code (Photon). They are not very far off >>> >from https://github.com/apache/arrow/pull/43793. I will update that PR in >>>> the coming weeks so that we can repro the same benchmarks with open source >>>> code too. >>>> >>>> On Fri, Oct 17, 2025 at 5:52 PM Andrew Lamb <[email protected]> wrote: >>>> >>>>> Thanks Alkis, that is interesting data. >>>>> >>>>>> We found that the reported numbers were not reproducible on AWS instances >>>>> I just updated the benchmark results[1] to include results from >>>>> AWS m6id.8xlarge instance (and they are indeed about 2x slower than when >>>>> run on my 2023 Mac laptop) >>>>> >>>>>> You can find the summary of our findings in a separate tab in the >>>>> proposal document: >>>>> >>>>> Thank you, these are interesting. Can you share instructions on how to >>>>> reproduce the reported numbers? I am interested to review the code used to >>>>> generate these results (esp the C++ thrift code) >>>>> >>>>> Thanks >>>>> Andrew >>>>> >>>>> >>>>> [1]: >>>>> >>>>> https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux >>>>> >>>>> >>>>> On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos >>>>> <[email protected]> wrote: >>>>> >>>>>> Thank you Andrew for putting the code in open source so that we can repro >>>>>> it. >>>>>> >>>>>> We have run the rust benchmarks and also run the flatbuf proposal with >>>>> our >>>>>> C++ thrift parser, the flatbuf footer with Thrift conversion, the >>>>>> flatbuf footer without Thrift conversion, and the flatbuf footer >>>>>> without Thrift conversion and without verification. You can find the >>>>>> summary of our findings in a separate tab in the proposal document: >>>>>> >>>>>> >>>>> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s >>>>>> The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the >>>>>> optimized Thrift parsing. It also remains faster than the Thrift parser >>>>>> even if the Thrift parser skips statistics. Furthermore if Thrift >>>>>> conversion is skipped, the speedup is 50x, and if verification is skipped >>>>>> it goes beyond 150x. >>>>>> >>>>>> >>>>>> On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I did some benchmarking for the new parser[2] we are working on in >>>>>>> arrow-rs. >>>>>>> >>>>>>> This benchmark achieves nearly an order of magnitude improvement (7x) >>>>>>> parsing Parquet metadata with no changes to the Parquet format, by >>>>> simply >>>>>>> writing a more efficient thrift decoder (which can also skip >>>>> statistics). >>>>>>> While we have not implemented a similar decoder in other languages such >>>>>> as >>>>>>> C/C++ or Java, given the similarities in the existing thrift libraries >>>>>> and >>>>>>> usage, we expect similar improvements are possible in those languages >>>>> as >>>>>>> well. >>>>>>> >>>>>>> Here are some inline images: >>>>>>> [image: image.png] >>>>>>> [image: image.png] >>>>>>> >>>>>>> >>>>>>> You can find full details here [1] >>>>>>> >>>>>>> Andrew >>>>>>> >>>>>>> >>>>>>> [1]: https://github.com/alamb/parquet_footer_parsing >>>>>>> [2]: https://github.com/apache/arrow-rs/issues/5854 >>>>>>> >>>>>>> >>>>>>> On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl <[email protected]> wrote: >>>>>>> >>>>>>>>> Concerning Thrift optimization, while a 2-3x improvement might be >>>>>>>>> achievable, Flatbuffers are currently demonstrating a 10x >>>>> improvement. >>>>>>>>> Andrew, do you have a more precise estimate for the speedup we could >>>>>>>> expect >>>>>>>>> in C++? >>>>>>>> Given my past experience on cuDF, I'd estimate about 2X there as well. >>>>>>>> cuDF has it's own metadata parser that I once benchmarked against the >>>>>>>> thrift generated parser. >>>>>>>> >>>>>>>> And I'd point out that beyond the initial 2X improvement, rolling your >>>>>>>> own parser frees you of having to parse out every structure in the >>>>>> metadata. >>>>>>>> >>>>>>> >>>>>> >>>>> >> >>
