We have to deal with very wide files (up to million columns)

The approach we took is very similar to flatbuffer metadata + skiplist

Seeing this happening in parquet open interesting possibilities

This is explained in this blog:

Husky: Efficient compaction at Datadog scale | Datadog
https://www.datadoghq.com/blog/engineering/husky-storage-compaction/

https://imgix.datadoghq.com/img/blog/engineering/husky-storage-compaction/compaction_static_diagram_2_rev.png


On Sat, Oct 18, 2025, 9:58 PM Ed Seidl <[email protected]> wrote:

> Of course there's nothing to preclude adding just such an index to the
> current format.
>
> On 2025/10/17 22:10:36 Corwin Joy wrote:
> > For us, the exciting thing about the flatbuf footer approach is the
> > potential for fast random access. For wide tables, the metadata becomes
> > huge, and there is a lot of overhead to access a particular rowgroup.
> (See
> > previous discussions, e.g., https://github.com/apache/arrow/issues/38149
> ).
> > Even if we can get a faster thrift parser, this is still limited, because
> > you have to parse the entire metadata, which is inherently slow. Pulling
> > information for a selected rowgroup is a lot faster.
> > Right now, we have a workaround: we create an external index to get fast
> > random access. (https://github.com/G-Research/PalletJack). But, having a
> > fast internal random access index like the proposed flatbuf footer would
> be
> > a big step forward.
> >
> > On Fri, Oct 17, 2025 at 8:50 AM Andrew Lamb <[email protected]>
> wrote:
> >
> > > Thanks Alkis, that is interesting data.
> > >
> > > > We found that the reported numbers were not reproducible on AWS
> instances
> > >
> > > I just updated the benchmark results[1] to include results from
> > > AWS m6id.8xlarge instance (and they are indeed about 2x slower than
> when
> > > run on my 2023 Mac laptop)
> > >
> > > > You can find the summary of our findings in a separate tab in the
> > > proposal document:
> > >
> > > Thank you, these are interesting. Can you share instructions on how to
> > > reproduce the reported numbers? I am interested to review the code
> used to
> > > generate these results (esp the C++ thrift code)
> > >
> > > Thanks
> > > Andrew
> > >
> > >
> > > [1]:
> > >
> > >
> https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux
> > >
> > >
> > > On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
> > > <[email protected]> wrote:
> > >
> > > > Thank you Andrew for putting the code in open source so that we can
> repro
> > > > it.
> > > >
> > > > We have run the rust benchmarks and also run the flatbuf proposal
> with
> > > our
> > > > C++ thrift parser, the flatbuf footer with Thrift conversion, the
> > > > flatbuf footer without Thrift conversion, and the flatbuf footer
> > > > without Thrift conversion and without verification. You can find the
> > > > summary of our findings in a separate tab in the proposal document:
> > > >
> > > >
> > >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s
> > > >
> > > > The TLDR is that flatbuf is 5x faster with the Thrift conversion vs
> the
> > > > optimized Thrift parsing. It also remains faster than the Thrift
> parser
> > > > even if the Thrift parser skips statistics. Furthermore if Thrift
> > > > conversion is skipped, the speedup is 50x, and if verification is
> skipped
> > > > it goes beyond 150x.
> > > >
> > > >
> > > > On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb <[email protected]>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I did some benchmarking for the new parser[2] we are working on in
> > > > > arrow-rs.
> > > > >
> > > > > This benchmark achieves nearly an order of magnitude improvement
> (7x)
> > > > > parsing Parquet metadata with no changes to the Parquet format, by
> > > simply
> > > > > writing a more efficient thrift decoder (which can also skip
> > > statistics).
> > > > >
> > > > > While we have not implemented a similar decoder in other languages
> such
> > > > as
> > > > > C/C++ or Java, given the similarities in the existing thrift
> libraries
> > > > and
> > > > > usage, we expect similar improvements are possible in those
> languages
> > > as
> > > > > well.
> > > > >
> > > > > Here are some inline images:
> > > > > [image: image.png]
> > > > > [image: image.png]
> > > > >
> > > > >
> > > > > You can find full details here [1]
> > > > >
> > > > > Andrew
> > > > >
> > > > >
> > > > > [1]: https://github.com/alamb/parquet_footer_parsing
> > > > > [2]: https://github.com/apache/arrow-rs/issues/5854
> > > > >
> > > > >
> > > > > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl <[email protected]>
> wrote:
> > > > >
> > > > >> > Concerning Thrift optimization, while a 2-3x improvement might
> be
> > > > >> > achievable, Flatbuffers are currently demonstrating a 10x
> > > improvement.
> > > > >> > Andrew, do you have a more precise estimate for the speedup we
> could
> > > > >> expect
> > > > >> > in C++?
> > > > >>
> > > > >> Given my past experience on cuDF, I'd estimate about 2X there as
> well.
> > > > >> cuDF has it's own metadata parser that I once benchmarked against
> the
> > > > >> thrift generated parser.
> > > > >>
> > > > >> And I'd point out that beyond the initial 2X improvement, rolling
> your
> > > > >> own parser frees you of having to parse out every structure in the
> > > > metadata.
> > > > >>
> > > > >
> > > >
> > >
> >
>

Reply via email to