Breaking this off into its own thread. In case anyone is interested, I just published a blog[1] post about the new metadata decoder we released for the Rust implementation of Parquet that explains background, results we achieved, and how it works.
Andrew [1]: https://arrow.apache.org/blog/2025/10/23/rust-parquet-metadata/ On Wed, Sep 17, 2025 at 4:02 AM Antoine Pitrou <[email protected]> wrote: > > Hi Andrew, > > I haven't heard of anything like this for C++, but it is an intriguing > idea. > > Regards > > Antoine. > > > On Tue, 16 Sep 2025 16:44:14 -0400 > Andrew Lamb <[email protected]> > wrote: > > Has anyone spent time optimizing the thrift decoder (e.g. not just use > > whatever a general purpose thrift compiler generates, but custom code a > > parser just for Parquet metadata)? > > > > Ed is in the process of implementing just such a decoder in arrow-rs[1] > and > > has seen a 2-3x performance improvement (with no change to the format) in > > early benchmark results. This is inline with our earlier work on the > > topic[2] where we estimated there is a 2-4x performance improvement with > > implementation improvements alone. > > > > Andrew > > > > [1]: https://github.com/apache/arrow-rs/issues/5854 > > [2]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/ > > > > On Tue, Sep 16, 2025 at 4:12 AM Antoine Pitrou < > [email protected]> wrote: > > > > > > > > Hi again, > > > > > > Ok, a quick summary of my current feedback on this: > > > > > > - decoding speed measurements are given, but not footer size > > > measurements; it would be interesting to have both > > > > > > - it's not obvious whether the stated numbers are for reading all > > > columns or a subset of them > > > > > > - optional LZ4 compression is mentioned, but no numbers are given for > > > it; it would be nice if numbers were available for both uncompressed > > > and compressed footers > > > > > > - the numbers seem quite underwhelming currently, I think most of us > > > were expecting massive speed improvements given past discussions > > > > > > - I'm firmly against narrowing sizes to 32 bits; making the footer more > > > compact is useful, but not to the point of reducing usefulness or > > > generality > > > > > > > > > A more general proposal: given the slightly underwhelming perf > > > numbers, has nested Flatbuffers been considered as an alternative? > > > > > > For example, the RowGroup table could become: > > > ``` > > > table ColumnChunk { > > > file_path: string; > > > meta_data: ColumnMetadata; > > > // etc. > > > } > > > > > > struct EncodedColumnChunk { > > > // Flatbuffers-encoded ColumnChunk, to be decoded/validated > indidually > > > column: [ubyte]; > > > } > > > > > > table RowGroup { > > > columns: [EncodedColumnChunk]; > > > total_byte_size: int; > > > num_rows: int; > > > sorting_columns: [SortingColumn]; > > > file_offset: long; > > > total_compressed_size: int; > > > ordinal: short = null; > > > } > > > ``` > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > > On Thu, 11 Sep 2025 08:41:34 +0200 > > > Alkis Evlogimenos > > > <[email protected]> > > > wrote: > > > > Hi all. I am sharing as a separate thread the proposal for the footer > > > > change we have been working on: > > > > > > > > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit > > > > > . > > > > > > > > The proposal outlines the technical aspects of the design and the > > > > experimental results of shadow testing this in production workloads. > I > > > > would like to discuss the proposal's most salient points in the > next > > > sync: > > > > 1. the use of flatbuffers as footer serialization format > > > > 2. the additional limitations imposed on parquet files (row group > size > > > > limit, row group max num row limit) > > > > > > > > I would prefer comments on the google doc to facilitate async > discussion. > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > >
