Re: [DISCUSS] flatbuf footer: offsets

Alkis Evlogimenos Mon, 03 Nov 2025 06:33:03 -0800

Assuming LZ4 compression at 2gb/sec (per core) and network bandwidth at
1gb/sec, and taking as example the 367mb thrift footer in the proposal, the
tradeoff is as follows:
T=thrift, F32=flatbuf with 32-bit offsets, F64=flatbuf with 64-bit offsets


T (367mb): 50ms latency + 370ms transfer --> 420ms (ignoring parse time)
F32 (113mb raw / 50mb lz4): 50ms latency + 50ms transfer + 56ms
decompression --> 156ms
F64 (155mb raw / 52mb lz4): 50ms latency + 52ms transfer + 78ms
decompression --> 180ms

Going with 64 bit offsets leaves some performance on the table and it will
make lz4 compression pretty much required for most footers above 256kb.
That said 64-bit offsets are still much faster at transfer than thrift even
ignoring the horrendous parse times.

For simplicity I am still slightly in favor of 64 bit offsets but I am open
to argumentation for 32 bit relative offsets plus alignment to bring row
group size to 64gb.

Thoughts?


On Tue, Oct 28, 2025 at 10:57 AM Antoine Pitrou <[email protected]> wrote:

>
> Hi,
>
> I expect LZ4 to be optional, but enabled by default by most writers.
> LZ4 decompression is extremely fast, typically several GB/s on a modern
> CPU.
>
> Regards
>
> Antoine.
>
>
> On Mon, 27 Oct 2025 17:06:07 +0100
> Jan Finis <[email protected]> wrote:
> > You are right that even without LZ4, we would still need I/O for the
> whole
> > footer. And I guess LZ4 is way faster than thrift, so flatbuf+LZ4 would
> be
> > an improvement over thrift. If you want superb partial decoding, we would
> > indeed need to somehow support only reading part of the footer from
> > storage. In the end, it's a trade-off. The more flexibility we want
> w.r.t.
> > partial reads, the more complexity we have to introduce. Maybe flatbuf
> > alone is already the sweet spot here and we shouldn't introduce
> additional
> > complexity. LZ4 compression would after all still be optional, right?
> >
> > Someone mentioned that they have footers with millions of columns. Maybe
> > they should comment on how much partial reading would be required for
> their
> > use case. I guess the answer will be "the more support for partial
> > reading/decoding the better".
> >
> > You could argue that if you have such a wide file, just don't use LZ4
> then
> > and that's probably a valid argument.
> >
> > Cheers,
> > Jan
> >
> >
> >
> > Am Mo., 27. Okt. 2025 um 09:28 Uhr schrieb Antoine Pitrou <
> > [email protected]>:
> >
> > >
> > > Hmmm... does it?
> > >
> > > I may be mistaken, but I had the impression that what you call "read
> > > only the parts of the footer I'm interested in" is actually "*decode*
> > > only the parts of the footer I'm interested in".
> > >
> > > That is, you still read the entire footer, which is a larger IO than
> > > doing smaller reads, but it's also a single IO rather than several
> > > smaller ones.
> > >
> > > Of course, if we want to make things more flexible, we can have
> > > individual Flatbuffers metadata pieces for each column, each
> > > LZ4-compressed. And embed two sizes at the end of the file: the size of
> > > the "core footer" metadata (without columns) and the size of the "full
> > > footer" metadata (with columns); so that readers can choose their
> > > preferred strategy.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Sat, 25 Oct 2025 14:39:37 +0200
> > > Jan Finis <[email protected]> wrote:
> > > > Note that LZ4 compression destroys the whole "I can read only the
> parts
> > > of
> > > > the footer I'm interested in", so I wouldn't say that LZ4 can be the
> > > > solution to everything.
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > > > On Sat, Oct 25, 2025, 12:33 Antoine Pitrou <
> > > [email protected]> wrote:
> > > >
> > > > > On Fri, 24 Oct 2025 12:12:02 -0700
> > > > > Julien Le Dem <[email protected]> wrote:
> > > > > > I had an idea about this topic.
> > > > > > What if we say the offset is always a multiple of 16? (I'm
> saying
> > > 16, but
> > > > > > it works with 8 or 32 or any other power of 2).
> > > > > > Then we store in the footer the offset divided by 16.
> > > > > > That means you need to pad each row group by up to 16 bytes.
> > > > > > But now the max size of the file is 32GB.
> > > > > >
> > > > > > Personally, I still don't like having arbitrary limits but 32GB
> > > seems a
> > > > > lot
> > > > > > less like a restricting limit than 2GB.
> > > > > > If we get crazy, we add this to the footer as metadata and the
> > > writer
> > > > > gets
> > > > > > to pick whether you multiply offsets by 32, 64 or 128 if ten
> years
> > > from
> > > > > now
> > > > > > we start having much bigger files.
> > > > > > The size of the padding becomes negligible over the size of the
> file.
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > That's an interesting suggestion. I would be fine with it
> personally,
> > > > > provided the multiplier is either large enough (say, 64) or
> embedded in
> > > > > the footer.
> > > > >
> > > > > That said, I would first wait for the outcome of the experiment
> with
> > > > > LZ4 compression. If it negates the additional cost of 64-bit
> offsets,
> > > > > then we should not bother with this multiplier mechanism.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
> > > > > > <
> alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl...@public.gmane.org>
> wrote:
> > > > > >
> > > > > > > We've analyzed a large footer from our production environment
> to
> > > > > understand
> > > > > > > byte distribution across its fields. The detailed analysis is
> > > > > available in
> > > > > > > the proposal document here:
> > > > > > >
> > > > > > >
> > > > >
> > >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> > >
> > > > > > > .
> > > > > > >
> > > > > > > To illustrate the impact of 64-bit fields, we conducted an
> > > experiment
> > > > > where
> > > > > > > all proposed 32-bit fields in the Flatbuf footer were changed
> to
> > > > > 64-bit.
> > > > > > > This resulted in a *40% increase* in footer size.
> > > > > > >
> > > > > > > That said, LZ4 manages to compress this away. We will do some
> > > more
> > > > > testing
> > > > > > > with 64 bit offsets/numvals/sizes and revert back. If it all
> goes
> > > well
> > > > > we
> > > > > > > can resolve this by going 64 bit.
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Oct 15, 2025 at 12:49 PM Jan Finis <
> > > > >
> jpfinis-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org>
> wrote:
> > > > > > >
> > > > > > > > Hi Alkis,
> > > > > > > >
> > > > > > > > one more very simple argument why you want these offsets to
> be
> > > i64:
> > > > > > > > What if you want to store a single value larger than 4GB? I
> know
> > > this
> > > > > > > > sounds absurd at first, but some use cases might want to
> store
> > > data
> > > > > that
> > > > > > > > can sometimes be very large (e.g. blob data, or insanely
> > > complex
> > > > > geo
> > > > > > > data).
> > > > > > > > And it would be a shame if that would mean that they cannot
> use
> > > > > Parquet
> > > > > > > at
> > > > > > > > all.
> > > > > > > >
> > > > > > > > Thus, my opinion here is that we can limit to i32 all
> fields
> > > that
> > > > > the
> > > > > > > file
> > > > > > > > writer has under control, e.g., the number of rows within a
> row
> > > > > group,
> > > > > > > but
> > > > > > > > we shouldn't limit any values that a file writer doesn't
> have
> > > under
> > > > > > > > control, as they fully depend on the input data.
> > > > > > > >
> > > > > > > > Note though that this means that the number of values in a
> > > column
> > > > > chunk
> > > > > > > > could also exceed i32, if a user has nested data with more
> than
> > > 4
> > > > > billion
> > > > > > > > entries. With such data, the file writer again couldn't do
> > > anything
> > > > > to
> > > > > > > > avoid writing a row group with more
> > > > > > > > than i32 values, as a single row may not span multiple row
> > > groups.
> > > > > That
> > > > > > > > being said, I think that nested data with more than 4
> billion
> > > > > entries is
> > > > > > > > less likely than a single large blob of 4 billion bytes.
> > > > > > > >
> > > > > > > > I know that smaller row groups is what most / all engines
> > > prefer,
> > > > > but we
> > > > > > > > have to make sure the format also works for edge cases.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Jan
> > > > > > > >
> > > > > > > > Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve
> > > > > <adreeve-Re5JQEeQqe8-XMD5yJDbdMReXY1tMh2IBgC/
> [email protected]
> > > > > > > >:
> > > > > > > >
> > > > > > > > > Hi Alkis
> > > > > > > > >
> > > > > > > > > Thanks for all your work on this proposal.
> > > > > > > > >
> > > > > > > > > I'd be in favour of keeping the offsets as i64 and not
> > > reducing
> > > > > the
> > > > > > > > maximum
> > > > > > > > > row group size, even if this results in slightly larger
> > > footers.
> > > > > I've
> > > > > > > > heard
> > > > > > > > > from some of our users within G-Research that they do
> have
> > > files
> > > > > with
> > > > > > > row
> > > > > > > > > groups > 2 GiB. This is often when they use lower-level
> APIs
> > > to
> > > > > write
> > > > > > > > > Parquet that don't automatically split data into row
> groups,
> > > and
> > > > > they
> > > > > > > > > either write a single row group for simplicity or have
> some
> > > logical
> > > > > > > > > partitioning of data into row groups. They might also
> have
> > > wide
> > > > > tables
> > > > > > > > with
> > > > > > > > > many columns, or wide array/tensor valued columns that
> lead
> > > to
> > > > > large
> > > > > > > row
> > > > > > > > > groups.
> > > > > > > > >
> > > > > > > > > In many workflows we don't read Parquet with a query
> engine
> > > that
> > > > > > > supports
> > > > > > > > > filters and skipping row groups, but just read all rows,
> or
> > > > > directly
> > > > > > > > > specify the row groups to read if there is some known
> logical
> > > > > > > > partitioning
> > > > > > > > > into row groups. I'm sure we could work around a 2 or 4
> GiB
> > > row
> > > > > group
> > > > > > > > size
> > > > > > > > > limitation if we had to, but it's a new constraint that
> > > reduces the
> > > > > > > > > flexibility of the format and makes more work for users
> who
> > > now
> > > > > need to
> > > > > > > > > ensure they don't hit this limit.
> > > > > > > > >
> > > > > > > > > Do you have any measurements of how much of a difference
> 4
> > > byte
> > > > > offsets
> > > > > > > > > make to footer sizes in your data, with and without the
> > > optional
> > > > > LZ4
> > > > > > > > > compression?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Adam
> > > > > > > > >
> > > > > > > > > On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
> > > > > > > > > <
> > > > >
> > >
> alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl79t-xmd5yjdbdmrexy1tmh2...@public.gmane.org
>
> > > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > From the comments on the [EXTERNAL] Parquet metadata
> > > > > > > > > > <
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
>
> > > > >
> > > > > > > > > > >
> > > > > > > > > > document,
> > > > > > > > > > it appears there's a general consensus on most aspects,
> > > with
> > > > > the
> > > > > > > > > exception
> > > > > > > > > > of the relative 32-bit offsets for column chunks.
> > > > > > > > > >
> > > > > > > > > > I'm starting this thread to discuss this topic further
> and
> > > work
> > > > > > > > towards a
> > > > > > > > > > resolution. Adam Reeve suggested raising the limitation
> to
> > > 2^32,
> > > > > and
> > > > > > > he
> > > > > > > > > > confirmed that Java does not have any issues with this.
> I
> > > am
> > > > > open to
> > > > > > > > this
> > > > > > > > > > change as it increases the limit without introducing
> any
> > > > > drawbacks.
> > > > > > > > > >
> > > > > > > > > > However, some still feel that a 2^32-byte limit for a
> row
> > > group
> > > > > is
> > > > > > > too
> > > > > > > > > > restrictive. I'd like to understand these specific use
> > > cases
> > > > > better.
> > > > > > > > From
> > > > > > > > > > my perspective, for most engines, the row group is the
> > > primary
> > > > > unit
> > > > > > > of
> > > > > > > > > > skipping, making very large row groups less desirable.
> In
> > > our
> > > > > fleet's
> > > > > > > > > > workloads, it's rare to see row groups larger than
> 100MB,
> > > as
> > > > > anything
> > > > > > > > > > larger tends to make statistics-based skipping
> ineffective.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > >
> >
>
>
>
>

Re: [DISCUSS] flatbuf footer: offsets

Reply via email to