I don't disagree that two files is much harder than one file, but is
that the use-case that the flatbuffer format is solving for, or is that
adequately serviced by the existing thrift-based footer? I had
interpreted the flatbuffer more as a way to accelerate larger datasets
consisting of many files, and of less utility for the single-file use-case.
That being said I misread the proposal, I thought it was proposing
replacing the thrift based footer with a flatbuffer, which would be very
disruptive. However, it looks like instead the (new?) proposal is to
just create a duplicate flatbuffer footer embedded within the thrift
footer, which can just be ignored by readers. The proposal is a bit
vague when it comes to whether all information would be duplicated, or
whether some information would only be embedded in the flatbuffer
payload, but presuming it is a true duplicate, many of my points don't
apply.
Kind Regards,
Raphael
On 20/10/2025 15:28, Antoine Pitrou wrote:
I don't think it's a "small price to pay". Parquet files are widely
used to share or transfer data of all kinds (in a way, they replace CSV
with much better characteristics). Sharing a single file is easy,
sharing two related files while keeping their relationship intact is an
order of magnitude more difficult.
Regards
Antoine.
On Mon, 20 Oct 2025 12:23:20 +0100
Personal
<[email protected]>
wrote:
Apologies if this has already been discussed, but have we considered simply
storing these flatbuffers as separate files alongside existing parquet files. I
think this would have a number of quite compelling advantages:
- no breaking format changes, all readers can continue to still read all
parquet files
- people can generate these "index" files for existing datasets without having
to rewrite all their files
- older and newer readers can coexist - no stop the world migrations
- can potentially combine multiple flatbuffers into a single file for better IO
when scanning collections of files - potentially very valuable for object
stores, and would also help for people on HDFS and other systems that struggle
with small files
- could potentially even inline these flatbuffers into catalogs like iceberg
- can continue to iterate at a faster rate, without the constraints of needing
to move in lockstep with parquet versioning
- potentially less confusing for users, parquet files are still the same, they
just can be accelerated by these new index files
This would mean some data duplication, but that seems a small price to pay, and
would be strictly opt-in for users with use-cases that justify it?
Kind Regards,
Raphael
On 20 October 2025 11:08:59 BST, Alkis Evlogimenos
<[email protected]> wrote:
Thank you, these are interesting. Can you share instructions on how to
reproduce the reported numbers? I am interested to review the code used to
generate these results (esp the C++ thrift code)
The numbers are based on internal code (Photon). They are not very far off
>from https://github.com/apache/arrow/pull/43793. I will update that PR in
the coming weeks so that we can repro the same benchmarks with open source
code too.
On Fri, Oct 17, 2025 at 5:52 PM Andrew Lamb <[email protected]> wrote:
Thanks Alkis, that is interesting data.
We found that the reported numbers were not reproducible on AWS instances
I just updated the benchmark results[1] to include results from
AWS m6id.8xlarge instance (and they are indeed about 2x slower than when
run on my 2023 Mac laptop)
You can find the summary of our findings in a separate tab in the
proposal document:
Thank you, these are interesting. Can you share instructions on how to
reproduce the reported numbers? I am interested to review the code used to
generate these results (esp the C++ thrift code)
Thanks
Andrew
[1]:
https://github.com/alamb/parquet_footer_parsing?tab=readme-ov-file#results-on-linux
On Fri, Oct 17, 2025 at 10:20 AM Alkis Evlogimenos
<[email protected]> wrote:
Thank you Andrew for putting the code in open source so that we can repro
it.
We have run the rust benchmarks and also run the flatbuf proposal with
our
C++ thrift parser, the flatbuf footer with Thrift conversion, the
flatbuf footer without Thrift conversion, and the flatbuf footer
without Thrift conversion and without verification. You can find the
summary of our findings in a separate tab in the proposal document:
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s
The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the
optimized Thrift parsing. It also remains faster than the Thrift parser
even if the Thrift parser skips statistics. Furthermore if Thrift
conversion is skipped, the speedup is 50x, and if verification is skipped
it goes beyond 150x.
On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb <[email protected]>
wrote:
Hello,
I did some benchmarking for the new parser[2] we are working on in
arrow-rs.
This benchmark achieves nearly an order of magnitude improvement (7x)
parsing Parquet metadata with no changes to the Parquet format, by
simply
writing a more efficient thrift decoder (which can also skip
statistics).
While we have not implemented a similar decoder in other languages such
as
C/C++ or Java, given the similarities in the existing thrift libraries
and
usage, we expect similar improvements are possible in those languages
as
well.
Here are some inline images:
[image: image.png]
[image: image.png]
You can find full details here [1]
Andrew
[1]: https://github.com/alamb/parquet_footer_parsing
[2]: https://github.com/apache/arrow-rs/issues/5854
On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl <[email protected]> wrote:
Concerning Thrift optimization, while a 2-3x improvement might be
achievable, Flatbuffers are currently demonstrating a 10x
improvement.
Andrew, do you have a more precise estimate for the speedup we could
expect
in C++?
Given my past experience on cuDF, I'd estimate about 2X there as well.
cuDF has it's own metadata parser that I once benchmarked against the
thrift generated parser.
And I'd point out that beyond the initial 2X improvement, rolling your
own parser frees you of having to parse out every structure in the
metadata.