We have to deal with very wide files (up to million columns) The approach we took is very similar to flatbuffer metadata + skiplist
Seeing this happening in parquet open interesting possibilities This is explained in this blog: Husky: Efficient compaction at Datadog scale | Datadog https://www.datadoghq.com/blog/engineering/husky-storage-compaction/ https://imgix.datadoghq.com/img/blog/engineering/husky-storage-compaction/compaction_static_diagram_2_rev.png On Sat, Oct 18, 2025, 9:58 PM Ed Seidl <[email protected]> wrote: > Of course there's nothing to preclude adding just such an index to the > current format. > > On 2025/10/17 22:10:36 Corwin Joy wrote: > > For us, the exciting thing about the flatbuf footer approach is the > > potential for fast random access. For wide tables, the metadata becomes > > huge, and there is a lot of overhead to access a particular rowgroup. > (See > > previous discussions, e.g., https://github.com/apache/arrow/issues/38149 > ). > > Even if we can get a faster thrift parser, this is still limited, because > > you have to parse the entire metadata, which is inherently slow. Pulling > > information for a selected rowgroup is a lot faster. > > Right now, we have a workaround: we create an external index to get fast > > random access. (https://github.com/G-Research/PalletJack). But, having a > > fast internal random access index like the proposed flatbuf footer would > be > > a big step forward. > > > > On Fri, Oct 17, 2025 at 8:50 AM Andrew Lamb <[email protected]> > wrote: > > > > > Thanks Alkis, that is interesting data. > > > > > > > We found that the reported numbers were not reproducible on AWS > instances > > > > > > I just updated the benchmark results[1] to include results from > > > AWS m6id.8xlarge instance (and they are indeed about 2x slower than > when > > > run on my 2023 Mac laptop) > > > > > > > You can find the summary of our findings in a separate tab in the > > > proposal document: > > > > > > Thank you, these are interesting. Can you share instructions on how to > > > reproduce the reported numbers? I am interested to review the code > used to > > > generate these results (esp the C++ thrift code) > > > > > > Thanks > > > Andrew > > > > > > > > > [1]: > > > > > > > https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux > > > > > > > > > On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos > > > <[email protected]> wrote: > > > > > > > Thank you Andrew for putting the code in open source so that we can > repro > > > > it. > > > > > > > > We have run the rust benchmarks and also run the flatbuf proposal > with > > > our > > > > C++ thrift parser, the flatbuf footer with Thrift conversion, the > > > > flatbuf footer without Thrift conversion, and the flatbuf footer > > > > without Thrift conversion and without verification. You can find the > > > > summary of our findings in a separate tab in the proposal document: > > > > > > > > > > > > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s > > > > > > > > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs > the > > > > optimized Thrift parsing. It also remains faster than the Thrift > parser > > > > even if the Thrift parser skips statistics. Furthermore if Thrift > > > > conversion is skipped, the speedup is 50x, and if verification is > skipped > > > > it goes beyond 150x. > > > > > > > > > > > > On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb <[email protected]> > > > > wrote: > > > > > > > > > Hello, > > > > > > > > > > I did some benchmarking for the new parser[2] we are working on in > > > > > arrow-rs. > > > > > > > > > > This benchmark achieves nearly an order of magnitude improvement > (7x) > > > > > parsing Parquet metadata with no changes to the Parquet format, by > > > simply > > > > > writing a more efficient thrift decoder (which can also skip > > > statistics). > > > > > > > > > > While we have not implemented a similar decoder in other languages > such > > > > as > > > > > C/C++ or Java, given the similarities in the existing thrift > libraries > > > > and > > > > > usage, we expect similar improvements are possible in those > languages > > > as > > > > > well. > > > > > > > > > > Here are some inline images: > > > > > [image: image.png] > > > > > [image: image.png] > > > > > > > > > > > > > > > You can find full details here [1] > > > > > > > > > > Andrew > > > > > > > > > > > > > > > [1]: https://github.com/alamb/parquet_footer_parsing > > > > > [2]: https://github.com/apache/arrow-rs/issues/5854 > > > > > > > > > > > > > > > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl <[email protected]> > wrote: > > > > > > > > > >> > Concerning Thrift optimization, while a 2-3x improvement might > be > > > > >> > achievable, Flatbuffers are currently demonstrating a 10x > > > improvement. > > > > >> > Andrew, do you have a more precise estimate for the speedup we > could > > > > >> expect > > > > >> > in C++? > > > > >> > > > > >> Given my past experience on cuDF, I'd estimate about 2X there as > well. > > > > >> cuDF has it's own metadata parser that I once benchmarked against > the > > > > >> thrift generated parser. > > > > >> > > > > >> And I'd point out that beyond the initial 2X improvement, rolling > your > > > > >> own parser frees you of having to parse out every structure in the > > > > metadata. > > > > >> > > > > > > > > > > > > > > >
