Of course there's nothing to preclude adding just such an index to the current 
format.

On 2025/10/17 22:10:36 Corwin Joy wrote:
> For us, the exciting thing about the flatbuf footer approach is the
> potential for fast random access. For wide tables, the metadata becomes
> huge, and there is a lot of overhead to access a particular rowgroup. (See
> previous discussions, e.g., https://github.com/apache/arrow/issues/38149).
> Even if we can get a faster thrift parser, this is still limited, because
> you have to parse the entire metadata, which is inherently slow. Pulling
> information for a selected rowgroup is a lot faster.
> Right now, we have a workaround: we create an external index to get fast
> random access. (https://github.com/G-Research/PalletJack). But, having a
> fast internal random access index like the proposed flatbuf footer would be
> a big step forward.
> 
> On Fri, Oct 17, 2025 at 8:50 AM Andrew Lamb <[email protected]> wrote:
> 
> > Thanks Alkis, that is interesting data.
> >
> > > We found that the reported numbers were not reproducible on AWS instances
> >
> > I just updated the benchmark results[1] to include results from
> > AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
> > run on my 2023 Mac laptop)
> >
> > > You can find the summary of our findings in a separate tab in the
> > proposal document:
> >
> > Thank you, these are interesting. Can you share instructions on how to
> > reproduce the reported numbers? I am interested to review the code used to
> > generate these results (esp the C++ thrift code)
> >
> > Thanks
> > Andrew
> >
> >
> > [1]:
> >
> > https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux
> >
> >
> > On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
> > <[email protected]> wrote:
> >
> > > Thank you Andrew for putting the code in open source so that we can repro
> > > it.
> > >
> > > We have run the rust benchmarks and also run the flatbuf proposal with
> > our
> > > C++ thrift parser, the flatbuf footer with Thrift conversion, the
> > > flatbuf footer without Thrift conversion, and the flatbuf footer
> > > without Thrift conversion and without verification. You can find the
> > > summary of our findings in a separate tab in the proposal document:
> > >
> > >
> > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s
> > >
> > > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
> > > optimized Thrift parsing. It also remains faster than the Thrift parser
> > > even if the Thrift parser skips statistics. Furthermore if Thrift
> > > conversion is skipped, the speedup is 50x, and if verification is skipped
> > > it goes beyond 150x.
> > >
> > >
> > > On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb <[email protected]>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I did some benchmarking for the new parser[2] we are working on in
> > > > arrow-rs.
> > > >
> > > > This benchmark achieves nearly an order of magnitude improvement (7x)
> > > > parsing Parquet metadata with no changes to the Parquet format, by
> > simply
> > > > writing a more efficient thrift decoder (which can also skip
> > statistics).
> > > >
> > > > While we have not implemented a similar decoder in other languages such
> > > as
> > > > C/C++ or Java, given the similarities in the existing thrift libraries
> > > and
> > > > usage, we expect similar improvements are possible in those languages
> > as
> > > > well.
> > > >
> > > > Here are some inline images:
> > > > [image: image.png]
> > > > [image: image.png]
> > > >
> > > >
> > > > You can find full details here [1]
> > > >
> > > > Andrew
> > > >
> > > >
> > > > [1]: https://github.com/alamb/parquet_footer_parsing
> > > > [2]: https://github.com/apache/arrow-rs/issues/5854
> > > >
> > > >
> > > > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl <[email protected]> wrote:
> > > >
> > > >> > Concerning Thrift optimization, while a 2-3x improvement might be
> > > >> > achievable, Flatbuffers are currently demonstrating a 10x
> > improvement.
> > > >> > Andrew, do you have a more precise estimate for the speedup we could
> > > >> expect
> > > >> > in C++?
> > > >>
> > > >> Given my past experience on cuDF, I'd estimate about 2X there as well.
> > > >> cuDF has it's own metadata parser that I once benchmarked against the
> > > >> thrift generated parser.
> > > >>
> > > >> And I'd point out that beyond the initial 2X improvement, rolling your
> > > >> own parser frees you of having to parse out every structure in the
> > > metadata.
> > > >>
> > > >
> > >
> >
> 

Reply via email to