We've analyzed a large footer from our production environment to understand
byte distribution across its fields. The detailed analysis is available in
the proposal document here:
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
.

To illustrate the impact of 64-bit fields, we conducted an experiment where
all proposed 32-bit fields in the Flatbuf footer were changed to 64-bit.
This resulted in a *40% increase* in footer size.

That said, LZ4 manages to compress this away. We will do some more testing
with 64 bit offsets/numvals/sizes and revert back. If it all goes well we
can resolve this by going 64 bit.


On Wed, Oct 15, 2025 at 12:49 PM Jan Finis <[email protected]> wrote:

> Hi Alkis,
>
> one more very simple argument why you want these offsets to be i64:
> What if you want to store a single value larger than 4GB? I know this
> sounds absurd at first, but some use cases might want to store data that
> can sometimes be very large (e.g. blob data, or insanely complex geo data).
> And it would be a shame if that would mean that they cannot use Parquet at
> all.
>
> Thus, my opinion here is that we can limit to i32 all fields that the file
> writer has under control, e.g., the number of rows within a row group, but
> we shouldn't limit any values that a file writer doesn't have under
> control, as they fully depend on the input data.
>
> Note though that this means that the number of values in a column chunk
> could also exceed i32, if a user has nested data with more than 4 billion
> entries. With such data, the file writer again couldn't do anything to
> avoid writing a row group with more
> than i32 values, as a single row may not span multiple row groups. That
> being said, I think that nested data with more than 4 billion entries is
> less likely than a single large blob of 4 billion bytes.
>
> I know that smaller row groups is what most / all engines prefer, but we
> have to make sure the format also works for edge cases.
>
> Cheers,
> Jan
>
> Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve <[email protected]>:
>
> > Hi Alkis
> >
> > Thanks for all your work on this proposal.
> >
> > I'd be in favour of keeping the offsets as i64 and not reducing the
> maximum
> > row group size, even if this results in slightly larger footers. I've
> heard
> > from some of our users within G-Research that they do have files with row
> > groups > 2 GiB. This is often when they use lower-level APIs to write
> > Parquet that don't automatically split data into row groups, and they
> > either write a single row group for simplicity or have some logical
> > partitioning of data into row groups. They might also have wide tables
> with
> > many columns, or wide array/tensor valued columns that lead to large row
> > groups.
> >
> > In many workflows we don't read Parquet with a query engine that supports
> > filters and skipping row groups, but just read all rows, or directly
> > specify the row groups to read if there is some known logical
> partitioning
> > into row groups. I'm sure we could work around a 2 or 4 GiB row group
> size
> > limitation if we had to, but it's a new constraint that reduces the
> > flexibility of the format and makes more work for users who now need to
> > ensure they don't hit this limit.
> >
> > Do you have any measurements of how much of a difference 4 byte offsets
> > make to footer sizes in your data, with and without the optional LZ4
> > compression?
> >
> > Thanks,
> > Adam
> >
> > On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
> > <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > From the comments on the [EXTERNAL] Parquet metadata
> > > <
> > >
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
> > > >
> > > document,
> > > it appears there's a general consensus on most aspects, with the
> > exception
> > > of the relative 32-bit offsets for column chunks.
> > >
> > > I'm starting this thread to discuss this topic further and work
> towards a
> > > resolution. Adam Reeve suggested raising the limitation to 2^32, and he
> > > confirmed that Java does not have any issues with this. I am open to
> this
> > > change as it increases the limit without introducing any drawbacks.
> > >
> > > However, some still feel that a 2^32-byte limit for a row group is too
> > > restrictive. I'd like to understand these specific use cases better.
> From
> > > my perspective, for most engines, the row group is the primary unit of
> > > skipping, making very large row groups less desirable. In our fleet's
> > > workloads, it's rare to see row groups larger than 100MB, as anything
> > > larger tends to make statistics-based skipping ineffective.
> > >
> > > Cheers,
> > >
> >
>

Reply via email to