Hi Andrew,
I haven't heard of anything like this for C++, but it is an intriguing idea. Regards Antoine. On Tue, 16 Sep 2025 16:44:14 -0400 Andrew Lamb <[email protected]> wrote: > Has anyone spent time optimizing the thrift decoder (e.g. not just use > whatever a general purpose thrift compiler generates, but custom code a > parser just for Parquet metadata)? > > Ed is in the process of implementing just such a decoder in arrow-rs[1] and > has seen a 2-3x performance improvement (with no change to the format) in > early benchmark results. This is inline with our earlier work on the > topic[2] where we estimated there is a 2-4x performance improvement with > implementation improvements alone. > > Andrew > > [1]: https://github.com/apache/arrow-rs/issues/5854 > [2]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/ > > On Tue, Sep 16, 2025 at 4:12 AM Antoine Pitrou > <[email protected]> wrote: > > > > > Hi again, > > > > Ok, a quick summary of my current feedback on this: > > > > - decoding speed measurements are given, but not footer size > > measurements; it would be interesting to have both > > > > - it's not obvious whether the stated numbers are for reading all > > columns or a subset of them > > > > - optional LZ4 compression is mentioned, but no numbers are given for > > it; it would be nice if numbers were available for both uncompressed > > and compressed footers > > > > - the numbers seem quite underwhelming currently, I think most of us > > were expecting massive speed improvements given past discussions > > > > - I'm firmly against narrowing sizes to 32 bits; making the footer more > > compact is useful, but not to the point of reducing usefulness or > > generality > > > > > > A more general proposal: given the slightly underwhelming perf > > numbers, has nested Flatbuffers been considered as an alternative? > > > > For example, the RowGroup table could become: > > ``` > > table ColumnChunk { > > file_path: string; > > meta_data: ColumnMetadata; > > // etc. > > } > > > > struct EncodedColumnChunk { > > // Flatbuffers-encoded ColumnChunk, to be decoded/validated indidually > > column: [ubyte]; > > } > > > > table RowGroup { > > columns: [EncodedColumnChunk]; > > total_byte_size: int; > > num_rows: int; > > sorting_columns: [SortingColumn]; > > file_offset: long; > > total_compressed_size: int; > > ordinal: short = null; > > } > > ``` > > > > Regards > > > > Antoine. > > > > > > > > On Thu, 11 Sep 2025 08:41:34 +0200 > > Alkis Evlogimenos > > <[email protected]> > > wrote: > > > Hi all. I am sharing as a separate thread the proposal for the footer > > > change we have been working on: > > > > > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit > > > > > . > > > > > > The proposal outlines the technical aspects of the design and the > > > experimental results of shadow testing this in production workloads. I > > > would like to discuss the proposal's most salient points in the next > > sync: > > > 1. the use of flatbuffers as footer serialization format > > > 2. the additional limitations imposed on parquet files (row group size > > > limit, row group max num row limit) > > > > > > I would prefer comments on the google doc to facilitate async discussion. > > > > > > Thank you, > > > > > > > > > > > >
