> Andrew, do you have a more precise estimate for the speedup we could expect in C++?
I do not yet, but I will try and find out. I have filed an issue[1] to track the question / will try and enlist some help. It will be fun to benchmaxx our new parser Andrew [1]: https://github.com/apache/arrow-rs/issues/8441 On Wed, Sep 24, 2025 at 6:38 AM Alkis Evlogimenos <[email protected]> wrote: > Thank you all for taking the time to go through the doc and your feedback. > I'd like to address some of the key points raised: > > Regarding nested Flatbuffers, there's no parsing benefit to using them. In > the current prototype, approximately two-thirds of the decoding cost comes > from converting the Flatbuffer to `FileMetadata` (the Thrift object) to > simplify the rollout process. Even with this conversion, we're observing a > greater than 10x improvement in footer decoding time for footers that > perform poorly with Thrift (at the p999 percentile). Removing the > `FileMetadata` translation should easily provide another 2x speedup. > > Concerning Thrift optimization, while a 2-3x improvement might be > achievable, Flatbuffers are currently demonstrating a 10x improvement. > Andrew, do you have a more precise estimate for the speedup we could expect > in C++? It's also important to note that Thrift's format does not allow for > random access, meaning we will always have to parse the entire footer, > regardless of which columns are requested. > > I will work on getting numbers for LZ4 compressed versus raw footers, but > please be aware that this will take some time. > > Finally, the 32-bit narrowing of row group sizes appears to be the most > contentious aspect of the design. I suggest we discuss this live during our > next Parquet sync. For the record, shrinking the offsets is the second most > significant optimization for Flatbuffer footer size, with statistics being > the first. > > See you all in the next sync. > > > On Wed, Sep 17, 2025 at 10:02 AM Antoine Pitrou <[email protected]> > wrote: > > > > > Hi Andrew, > > > > I haven't heard of anything like this for C++, but it is an intriguing > > idea. > > > > Regards > > > > Antoine. > > > > > > On Tue, 16 Sep 2025 16:44:14 -0400 > > Andrew Lamb <[email protected]> > > wrote: > > > Has anyone spent time optimizing the thrift decoder (e.g. not just use > > > whatever a general purpose thrift compiler generates, but custom code a > > > parser just for Parquet metadata)? > > > > > > Ed is in the process of implementing just such a decoder in arrow-rs[1] > > and > > > has seen a 2-3x performance improvement (with no change to the format) > in > > > early benchmark results. This is inline with our earlier work on the > > > topic[2] where we estimated there is a 2-4x performance improvement > with > > > implementation improvements alone. > > > > > > Andrew > > > > > > [1]: https://github.com/apache/arrow-rs/issues/5854 > > > [2]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/ > > > > > > On Tue, Sep 16, 2025 at 4:12 AM Antoine Pitrou < > > [email protected]> wrote: > > > > > > > > > > > Hi again, > > > > > > > > Ok, a quick summary of my current feedback on this: > > > > > > > > - decoding speed measurements are given, but not footer size > > > > measurements; it would be interesting to have both > > > > > > > > - it's not obvious whether the stated numbers are for reading all > > > > columns or a subset of them > > > > > > > > - optional LZ4 compression is mentioned, but no numbers are given for > > > > it; it would be nice if numbers were available for both > uncompressed > > > > and compressed footers > > > > > > > > - the numbers seem quite underwhelming currently, I think most of us > > > > were expecting massive speed improvements given past discussions > > > > > > > > - I'm firmly against narrowing sizes to 32 bits; making the footer > more > > > > compact is useful, but not to the point of reducing usefulness or > > > > generality > > > > > > > > > > > > A more general proposal: given the slightly underwhelming perf > > > > numbers, has nested Flatbuffers been considered as an alternative? > > > > > > > > For example, the RowGroup table could become: > > > > ``` > > > > table ColumnChunk { > > > > file_path: string; > > > > meta_data: ColumnMetadata; > > > > // etc. > > > > } > > > > > > > > struct EncodedColumnChunk { > > > > // Flatbuffers-encoded ColumnChunk, to be decoded/validated > > indidually > > > > column: [ubyte]; > > > > } > > > > > > > > table RowGroup { > > > > columns: [EncodedColumnChunk]; > > > > total_byte_size: int; > > > > num_rows: int; > > > > sorting_columns: [SortingColumn]; > > > > file_offset: long; > > > > total_compressed_size: int; > > > > ordinal: short = null; > > > > } > > > > ``` > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > > > > > On Thu, 11 Sep 2025 08:41:34 +0200 > > > > Alkis Evlogimenos > > > > <[email protected]> > > > > wrote: > > > > > Hi all. I am sharing as a separate thread the proposal for the > footer > > > > > change we have been working on: > > > > > > > > > > > > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit > > > > > > > . > > > > > > > > > > The proposal outlines the technical aspects of the design and the > > > > > experimental results of shadow testing this in production > workloads. > > I > > > > > would like to discuss the proposal's most salient points in the > > next > > > > sync: > > > > > 1. the use of flatbuffers as footer serialization format > > > > > 2. the additional limitations imposed on parquet files (row group > > size > > > > > limit, row group max num row limit) > > > > > > > > > > I would prefer comments on the google doc to facilitate async > > discussion. > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
