Breaking this off into its own thread.

In case anyone is interested, I just published a blog[1] post about the new
metadata decoder we released for the Rust implementation of Parquet that
explains background, results we achieved, and how it works.

Andrew

[1]: https://arrow.apache.org/blog/2025/10/23/rust-parquet-metadata/



On Wed, Sep 17, 2025 at 4:02 AM Antoine Pitrou <[email protected]> wrote:

>
> Hi Andrew,
>
> I haven't heard of anything like this for C++, but it is an intriguing
> idea.
>
> Regards
>
> Antoine.
>
>
> On Tue, 16 Sep 2025 16:44:14 -0400
> Andrew Lamb <[email protected]>
> wrote:
> > Has anyone spent time optimizing the thrift decoder (e.g. not just use
> > whatever a general purpose thrift compiler generates, but custom code a
> > parser just for Parquet metadata)?
> >
> > Ed is in the process of implementing just such a decoder in arrow-rs[1]
> and
> > has seen a 2-3x performance improvement (with no change to the format) in
> > early benchmark results. This is inline with our earlier work on the
> > topic[2] where we estimated there is a 2-4x performance improvement with
> > implementation improvements alone.
> >
> > Andrew
> >
> > [1]: https://github.com/apache/arrow-rs/issues/5854
> > [2]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
> >
> > On Tue, Sep 16, 2025 at 4:12 AM Antoine Pitrou <
> [email protected]> wrote:
> >
> > >
> > > Hi again,
> > >
> > > Ok, a quick summary of my current feedback on this:
> > >
> > > - decoding speed measurements are given, but not footer size
> > >   measurements; it would be interesting to have both
> > >
> > > - it's not obvious whether the stated numbers are for reading all
> > >   columns or a subset of them
> > >
> > > - optional LZ4 compression is mentioned, but no numbers are given for
> > >   it; it would be nice if numbers were available for both uncompressed
> > >   and compressed footers
> > >
> > > - the numbers seem quite underwhelming currently, I think most of us
> > >   were expecting massive speed improvements given past discussions
> > >
> > > - I'm firmly against narrowing sizes to 32 bits; making the footer more
> > >   compact is useful, but not to the point of reducing usefulness or
> > >   generality
> > >
> > >
> > > A more general proposal: given the slightly underwhelming perf
> > > numbers, has nested Flatbuffers been considered as an alternative?
> > >
> > > For example, the RowGroup table could become:
> > > ```
> > > table ColumnChunk {
> > >   file_path: string;
> > >   meta_data: ColumnMetadata;
> > >   // etc.
> > > }
> > >
> > > struct EncodedColumnChunk {
> > >   // Flatbuffers-encoded ColumnChunk, to be decoded/validated
> indidually
> > >   column: [ubyte];
> > > }
> > >
> > > table RowGroup {
> > >   columns: [EncodedColumnChunk];
> > >   total_byte_size: int;
> > >   num_rows: int;
> > >   sorting_columns: [SortingColumn];
> > >   file_offset: long;
> > >   total_compressed_size: int;
> > >   ordinal: short = null;
> > > }
> > > ```
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> > > On Thu, 11 Sep 2025 08:41:34 +0200
> > > Alkis Evlogimenos
> > > <[email protected]>
> > > wrote:
> > > > Hi all. I am sharing as a separate thread the proposal for the footer
> > > > change we have been working on:
> > > >
> > >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
>
> > > > .
> > > >
> > > > The proposal outlines the technical aspects of the design and the
> > > > experimental results of shadow testing this in production workloads.
> I
> > > > would like to discuss the proposal's most salient points in the
> next
> > > sync:
> > > > 1. the use of flatbuffers as footer serialization format
> > > > 2. the additional limitations imposed on parquet files (row group
> size
> > > > limit, row group max num row limit)
> > > >
> > > > I would prefer comments on the google doc to facilitate async
> discussion.
> > > >
> > > > Thank you,
> > > >
> > >
> > >
> > >
> > >
> >
>
>
>
>

Reply via email to