On Wed, 24 Sep 2025 12:37:13 +0200
Alkis Evlogimenos
<[email protected]>
wrote:
> Thank you all for taking the time to go through the doc and your feedback.
> I'd like to address some of the key points raised:
> 
> Regarding nested Flatbuffers, there's no parsing benefit to using them. In
> the current prototype, approximately two-thirds of the decoding cost comes
> from converting the Flatbuffer to `FileMetadata` (the Thrift object) to
> simplify the rollout process. Even with this conversion, we're observing a
> greater than 10x improvement in footer decoding time for footers that
> perform poorly with Thrift (at the p999 percentile). Removing the
> `FileMetadata` translation should easily provide another 2x speedup.

1. Your own numbers show p50 percentile performance at around 1x, not
10x. It's nice that p999 (!!) percentile performance is so good, but
that probably doesn't paint a representative picture of overall
performance.

2. It would be useful to have p05 and p01 performance results, by
the way. For now we know only about the best results, not the worst,
which is a bit surprising.

3. As you said in one of the comments: "even without Thrift, we still
have to verify the flatbuf which means we still have to walk all the
bytes". Nested Flatbuffers would avoid verifying the flatbuf data for
unused columns or indices, for example.

> Finally, the 32-bit narrowing of row group sizes appears to be the most
> contentious aspect of the design. I suggest we discuss this live during our
> next Parquet sync.

Well, not everyone can often make it to the Parquet syncs. Important
spec discussions should be accessible to anyone regardless of their
personal/professional schedules.

> For the record, shrinking the offsets is the second most
> significant optimization for Flatbuffer footer size, with statistics being
> the first.

I'm curious whether LZ4 would make the optimization less significant.

Regards

Antoine.


Reply via email to