You are right that even without LZ4, we would still need I/O for the whole
footer. And I guess LZ4 is way faster than thrift, so flatbuf+LZ4 would be
an improvement over thrift. If you want superb partial decoding, we would
indeed need to somehow support only reading part of the footer from
storage. In the end, it's a trade-off. The more flexibility we want w.r.t.
partial reads, the more complexity we have to introduce. Maybe flatbuf
alone is already the sweet spot here and we shouldn't introduce additional
complexity. LZ4 compression would after all still be optional, right?

Someone mentioned that they have footers with millions of columns. Maybe
they should comment on how much partial reading would be required for their
use case. I guess the answer will be "the more support for partial
reading/decoding the better".

You could argue that if you have such a wide file, just don't use LZ4 then
and that's probably a valid argument.

Cheers,
Jan



Am Mo., 27. Okt. 2025 um 09:28 Uhr schrieb Antoine Pitrou <
[email protected]>:

>
> Hmmm... does it?
>
> I may be mistaken, but I had the impression that what you call "read
> only the parts of the footer I'm interested in" is actually "*decode*
> only the parts of the footer I'm interested in".
>
> That is, you still read the entire footer, which is a larger IO than
> doing smaller reads, but it's also a single IO rather than several
> smaller ones.
>
> Of course, if we want to make things more flexible, we can have
> individual Flatbuffers metadata pieces for each column, each
> LZ4-compressed. And embed two sizes at the end of the file: the size of
> the "core footer" metadata (without columns) and the size of the "full
> footer" metadata (with columns); so that readers can choose their
> preferred strategy.
>
> Regards
>
> Antoine.
>
>
> On Sat, 25 Oct 2025 14:39:37 +0200
> Jan Finis <[email protected]> wrote:
> > Note that LZ4 compression destroys the whole "I can read only the parts
> of
> > the footer I'm interested in", so I wouldn't say that LZ4 can be the
> > solution to everything.
> >
> > Cheers,
> > Jan
> >
> > On Sat, Oct 25, 2025, 12:33 Antoine Pitrou <
> [email protected]> wrote:
> >
> > > On Fri, 24 Oct 2025 12:12:02 -0700
> > > Julien Le Dem <[email protected]> wrote:
> > > > I had an idea about this topic.
> > > > What if we say the offset is always a multiple of 16? (I'm saying
> 16, but
> > > > it works with 8 or 32 or any other power of 2).
> > > > Then we store in the footer the offset divided by 16.
> > > > That means you need to pad each row group by up to 16 bytes.
> > > > But now the max size of the file is 32GB.
> > > >
> > > > Personally, I still don't like having arbitrary limits but 32GB
> seems a
> > > lot
> > > > less like a restricting limit than 2GB.
> > > > If we get crazy, we add this to the footer as metadata and the
> writer
> > > gets
> > > > to pick whether you multiply offsets by 32, 64 or 128 if ten years
> from
> > > now
> > > > we start having much bigger files.
> > > > The size of the padding becomes negligible over the size of the file.
> > > >
> > > > Thoughts?
> > >
> > > That's an interesting suggestion. I would be fine with it personally,
> > > provided the multiplier is either large enough (say, 64) or embedded in
> > > the footer.
> > >
> > > That said, I would first wait for the outcome of the experiment with
> > > LZ4 compression. If it negates the additional cost of 64-bit offsets,
> > > then we should not bother with this multiplier mechanism.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > >
> > > >
> > > > On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
> > > > <[email protected]> wrote:
> > > >
> > > > > We've analyzed a large footer from our production environment to
> > > understand
> > > > > byte distribution across its fields. The detailed analysis is
> > > available in
> > > > > the proposal document here:
> > > > >
> > > > >
> > >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
>
> > > > > .
> > > > >
> > > > > To illustrate the impact of 64-bit fields, we conducted an
> experiment
> > > where
> > > > > all proposed 32-bit fields in the Flatbuf footer were changed to
> > > 64-bit.
> > > > > This resulted in a *40% increase* in footer size.
> > > > >
> > > > > That said, LZ4 manages to compress this away. We will do some
> more
> > > testing
> > > > > with 64 bit offsets/numvals/sizes and revert back. If it all goes
> well
> > > we
> > > > > can resolve this by going 64 bit.
> > > > >
> > > > >
> > > > > On Wed, Oct 15, 2025 at 12:49 PM Jan Finis <
> > > [email protected]> wrote:
> > > > >
> > > > > > Hi Alkis,
> > > > > >
> > > > > > one more very simple argument why you want these offsets to be
> i64:
> > > > > > What if you want to store a single value larger than 4GB? I know
> this
> > > > > > sounds absurd at first, but some use cases might want to store
> data
> > > that
> > > > > > can sometimes be very large (e.g. blob data, or insanely
> complex
> > > geo
> > > > > data).
> > > > > > And it would be a shame if that would mean that they cannot use
> > > Parquet
> > > > > at
> > > > > > all.
> > > > > >
> > > > > > Thus, my opinion here is that we can limit to i32 all fields
> that
> > > the
> > > > > file
> > > > > > writer has under control, e.g., the number of rows within a row
> > > group,
> > > > > but
> > > > > > we shouldn't limit any values that a file writer doesn't have
> under
> > > > > > control, as they fully depend on the input data.
> > > > > >
> > > > > > Note though that this means that the number of values in a
> column
> > > chunk
> > > > > > could also exceed i32, if a user has nested data with more than
> 4
> > > billion
> > > > > > entries. With such data, the file writer again couldn't do
> anything
> > > to
> > > > > > avoid writing a row group with more
> > > > > > than i32 values, as a single row may not span multiple row
> groups.
> > > That
> > > > > > being said, I think that nested data with more than 4 billion
> > > entries is
> > > > > > less likely than a single large blob of 4 billion bytes.
> > > > > >
> > > > > > I know that smaller row groups is what most / all engines
> prefer,
> > > but we
> > > > > > have to make sure the format also works for edge cases.
> > > > > >
> > > > > > Cheers,
> > > > > > Jan
> > > > > >
> > > > > > Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve
> > > <[email protected]
> > > > > >:
> > > > > >
> > > > > > > Hi Alkis
> > > > > > >
> > > > > > > Thanks for all your work on this proposal.
> > > > > > >
> > > > > > > I'd be in favour of keeping the offsets as i64 and not
> reducing
> > > the
> > > > > > maximum
> > > > > > > row group size, even if this results in slightly larger
> footers.
> > > I've
> > > > > > heard
> > > > > > > from some of our users within G-Research that they do have
> files
> > > with
> > > > > row
> > > > > > > groups > 2 GiB. This is often when they use lower-level APIs
> to
> > > write
> > > > > > > Parquet that don't automatically split data into row groups,
> and
> > > they
> > > > > > > either write a single row group for simplicity or have some
> logical
> > > > > > > partitioning of data into row groups. They might also have
> wide
> > > tables
> > > > > > with
> > > > > > > many columns, or wide array/tensor valued columns that lead
> to
> > > large
> > > > > row
> > > > > > > groups.
> > > > > > >
> > > > > > > In many workflows we don't read Parquet with a query engine
> that
> > > > > supports
> > > > > > > filters and skipping row groups, but just read all rows, or
> > > directly
> > > > > > > specify the row groups to read if there is some known logical
> > > > > > partitioning
> > > > > > > into row groups. I'm sure we could work around a 2 or 4 GiB
> row
> > > group
> > > > > > size
> > > > > > > limitation if we had to, but it's a new constraint that
> reduces the
> > > > > > > flexibility of the format and makes more work for users who
> now
> > > need to
> > > > > > > ensure they don't hit this limit.
> > > > > > >
> > > > > > > Do you have any measurements of how much of a difference 4
> byte
> > > offsets
> > > > > > > make to footer sizes in your data, with and without the
> optional
> > > LZ4
> > > > > > > compression?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Adam
> > > > > > >
> > > > > > > On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
> > > > > > > <
> > >
> alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl79t-xmd5yjdbdmrexy1tmh2...@public.gmane.org
> >
> > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > From the comments on the [EXTERNAL] Parquet metadata
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
> > >
> > > > > > > > >
> > > > > > > > document,
> > > > > > > > it appears there's a general consensus on most aspects,
> with
> > > the
> > > > > > > exception
> > > > > > > > of the relative 32-bit offsets for column chunks.
> > > > > > > >
> > > > > > > > I'm starting this thread to discuss this topic further and
> work
> > > > > > towards a
> > > > > > > > resolution. Adam Reeve suggested raising the limitation to
> 2^32,
> > > and
> > > > > he
> > > > > > > > confirmed that Java does not have any issues with this. I
> am
> > > open to
> > > > > > this
> > > > > > > > change as it increases the limit without introducing any
> > > drawbacks.
> > > > > > > >
> > > > > > > > However, some still feel that a 2^32-byte limit for a row
> group
> > > is
> > > > > too
> > > > > > > > restrictive. I'd like to understand these specific use
> cases
> > > better.
> > > > > > From
> > > > > > > > my perspective, for most engines, the row group is the
> primary
> > > unit
> > > > > of
> > > > > > > > skipping, making very large row groups less desirable. In
> our
> > > fleet's
> > > > > > > > workloads, it's rare to see row groups larger than 100MB,
> as
> > > anything
> > > > > > > > larger tends to make statistics-based skipping ineffective.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > >
> >
>
>
>
>

Reply via email to