I had an idea about this topic.
What if we say the offset is always a multiple of 16? (I'm saying 16, but
it works with 8 or 32 or any other power of 2).
Then we store in the footer the offset divided by 16.
That means you need to pad each row group by up to 16 bytes.
But now the max size of the file is 32GB.

Personally, I still don't like having arbitrary limits but 32GB seems a lot
less like a restricting limit than 2GB.
If we get crazy, we add this to the footer as metadata and the writer gets
to pick whether you multiply offsets by 32, 64 or 128 if ten years from now
we start having much bigger files.
The size of the padding becomes negligible over the size of the file.

Thoughts?


On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
<[email protected]> wrote:

> We've analyzed a large footer from our production environment to understand
> byte distribution across its fields. The detailed analysis is available in
> the proposal document here:
>
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> .
>
> To illustrate the impact of 64-bit fields, we conducted an experiment where
> all proposed 32-bit fields in the Flatbuf footer were changed to 64-bit.
> This resulted in a *40% increase* in footer size.
>
> That said, LZ4 manages to compress this away. We will do some more testing
> with 64 bit offsets/numvals/sizes and revert back. If it all goes well we
> can resolve this by going 64 bit.
>
>
> On Wed, Oct 15, 2025 at 12:49 PM Jan Finis <[email protected]> wrote:
>
> > Hi Alkis,
> >
> > one more very simple argument why you want these offsets to be i64:
> > What if you want to store a single value larger than 4GB? I know this
> > sounds absurd at first, but some use cases might want to store data that
> > can sometimes be very large (e.g. blob data, or insanely complex geo
> data).
> > And it would be a shame if that would mean that they cannot use Parquet
> at
> > all.
> >
> > Thus, my opinion here is that we can limit to i32 all fields that the
> file
> > writer has under control, e.g., the number of rows within a row group,
> but
> > we shouldn't limit any values that a file writer doesn't have under
> > control, as they fully depend on the input data.
> >
> > Note though that this means that the number of values in a column chunk
> > could also exceed i32, if a user has nested data with more than 4 billion
> > entries. With such data, the file writer again couldn't do anything to
> > avoid writing a row group with more
> > than i32 values, as a single row may not span multiple row groups. That
> > being said, I think that nested data with more than 4 billion entries is
> > less likely than a single large blob of 4 billion bytes.
> >
> > I know that smaller row groups is what most / all engines prefer, but we
> > have to make sure the format also works for edge cases.
> >
> > Cheers,
> > Jan
> >
> > Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve <[email protected]
> >:
> >
> > > Hi Alkis
> > >
> > > Thanks for all your work on this proposal.
> > >
> > > I'd be in favour of keeping the offsets as i64 and not reducing the
> > maximum
> > > row group size, even if this results in slightly larger footers. I've
> > heard
> > > from some of our users within G-Research that they do have files with
> row
> > > groups > 2 GiB. This is often when they use lower-level APIs to write
> > > Parquet that don't automatically split data into row groups, and they
> > > either write a single row group for simplicity or have some logical
> > > partitioning of data into row groups. They might also have wide tables
> > with
> > > many columns, or wide array/tensor valued columns that lead to large
> row
> > > groups.
> > >
> > > In many workflows we don't read Parquet with a query engine that
> supports
> > > filters and skipping row groups, but just read all rows, or directly
> > > specify the row groups to read if there is some known logical
> > partitioning
> > > into row groups. I'm sure we could work around a 2 or 4 GiB row group
> > size
> > > limitation if we had to, but it's a new constraint that reduces the
> > > flexibility of the format and makes more work for users who now need to
> > > ensure they don't hit this limit.
> > >
> > > Do you have any measurements of how much of a difference 4 byte offsets
> > > make to footer sizes in your data, with and without the optional LZ4
> > > compression?
> > >
> > > Thanks,
> > > Adam
> > >
> > > On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
> > > <[email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > From the comments on the [EXTERNAL] Parquet metadata
> > > > <
> > > >
> > >
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
> > > > >
> > > > document,
> > > > it appears there's a general consensus on most aspects, with the
> > > exception
> > > > of the relative 32-bit offsets for column chunks.
> > > >
> > > > I'm starting this thread to discuss this topic further and work
> > towards a
> > > > resolution. Adam Reeve suggested raising the limitation to 2^32, and
> he
> > > > confirmed that Java does not have any issues with this. I am open to
> > this
> > > > change as it increases the limit without introducing any drawbacks.
> > > >
> > > > However, some still feel that a 2^32-byte limit for a row group is
> too
> > > > restrictive. I'd like to understand these specific use cases better.
> > From
> > > > my perspective, for most engines, the row group is the primary unit
> of
> > > > skipping, making very large row groups less desirable. In our fleet's
> > > > workloads, it's rare to see row groups larger than 100MB, as anything
> > > > larger tends to make statistics-based skipping ineffective.
> > > >
> > > > Cheers,
> > > >
> > >
> >
>

Reply via email to