Re: [DISCUSS] flatbuf footer: offsets

Antoine Pitrou Mon, 27 Oct 2025 01:28:08 -0700


Hmmm... does it?


I may be mistaken, but I had the impression that what you call "read
only the parts of the footer I'm interested in" is actually "*decode*
only the parts of the footer I'm interested in".

That is, you still read the entire footer, which is a larger IO than
doing smaller reads, but it's also a single IO rather than several
smaller ones.

Of course, if we want to make things more flexible, we can have
individual Flatbuffers metadata pieces for each column, each
LZ4-compressed. And embed two sizes at the end of the file: the size of
the "core footer" metadata (without columns) and the size of the "full
footer" metadata (with columns); so that readers can choose their
preferred strategy.

Regards

Antoine.


On Sat, 25 Oct 2025 14:39:37 +0200
Jan Finis <[email protected]> wrote:
> Note that LZ4 compression destroys the whole "I can read only the parts of
> the footer I'm interested in", so I wouldn't say that LZ4 can be the
> solution to everything.
> 
> Cheers,
> Jan
> 
> On Sat, Oct 25, 2025, 12:33 Antoine Pitrou 
> <[email protected]> wrote:
> 
> > On Fri, 24 Oct 2025 12:12:02 -0700
> > Julien Le Dem <[email protected]> wrote:  
> > > I had an idea about this topic.
> > > What if we say the offset is always a multiple of 16? (I'm saying 16, but
> > > it works with 8 or 32 or any other power of 2).
> > > Then we store in the footer the offset divided by 16.
> > > That means you need to pad each row group by up to 16 bytes.
> > > But now the max size of the file is 32GB.
> > >
> > > Personally, I still don't like having arbitrary limits but 32GB seems a  
> > lot  
> > > less like a restricting limit than 2GB.
> > > If we get crazy, we add this to the footer as metadata and the writer  
> > gets  
> > > to pick whether you multiply offsets by 32, 64 or 128 if ten years from  
> > now  
> > > we start having much bigger files.
> > > The size of the padding becomes negligible over the size of the file.
> > >
> > > Thoughts?  
> >
> > That's an interesting suggestion. I would be fine with it personally,
> > provided the multiplier is either large enough (say, 64) or embedded in
> > the footer.
> >
> > That said, I would first wait for the outcome of the experiment with
> > LZ4 compression. If it negates the additional cost of 64-bit offsets,
> > then we should not bother with this multiplier mechanism.
> >
> > Regards
> >
> > Antoine.
> >
> >  
> > >
> > >
> > > On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos
> > > <[email protected]> wrote:
> > >  
> > > > We've analyzed a large footer from our production environment to  
> > understand  
> > > > byte distribution across its fields. The detailed analysis is  
> > available in  
> > > > the proposal document here:
> > > >
> > > >  
> > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk
> >   
> > > > .
> > > >
> > > > To illustrate the impact of 64-bit fields, we conducted an experiment  
> > where  
> > > > all proposed 32-bit fields in the Flatbuf footer were changed to  
> > 64-bit.  
> > > > This resulted in a *40% increase* in footer size.
> > > >
> > > > That said, LZ4 manages to compress this away. We will do some more  
> > testing  
> > > > with 64 bit offsets/numvals/sizes and revert back. If it all goes well  
> > we  
> > > > can resolve this by going 64 bit.
> > > >
> > > >
> > > > On Wed, Oct 15, 2025 at 12:49 PM Jan Finis <  
> > [email protected]> wrote:  
> > > >  
> > > > > Hi Alkis,
> > > > >
> > > > > one more very simple argument why you want these offsets to be i64:
> > > > > What if you want to store a single value larger than 4GB? I know this
> > > > > sounds absurd at first, but some use cases might want to store data  
> > that  
> > > > > can sometimes be very large (e.g. blob data, or insanely complex  
> > geo  
> > > > data).  
> > > > > And it would be a shame if that would mean that they cannot use  
> > Parquet  
> > > > at  
> > > > > all.
> > > > >
> > > > > Thus, my opinion here is that we can limit to i32 all fields that  
> > the  
> > > > file  
> > > > > writer has under control, e.g., the number of rows within a row  
> > group,  
> > > > but  
> > > > > we shouldn't limit any values that a file writer doesn't have under
> > > > > control, as they fully depend on the input data.
> > > > >
> > > > > Note though that this means that the number of values in a column  
> > chunk  
> > > > > could also exceed i32, if a user has nested data with more than 4  
> > billion  
> > > > > entries. With such data, the file writer again couldn't do anything  
> > to  
> > > > > avoid writing a row group with more
> > > > > than i32 values, as a single row may not span multiple row groups.  
> > That  
> > > > > being said, I think that nested data with more than 4 billion  
> > entries is  
> > > > > less likely than a single large blob of 4 billion bytes.
> > > > >
> > > > > I know that smaller row groups is what most / all engines prefer,  
> > but we  
> > > > > have to make sure the format also works for edge cases.
> > > > >
> > > > > Cheers,
> > > > > Jan
> > > > >
> > > > > Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve  
> > <[email protected]  
> > > > >:
> > > > >  
> > > > > > Hi Alkis
> > > > > >
> > > > > > Thanks for all your work on this proposal.
> > > > > >
> > > > > > I'd be in favour of keeping the offsets as i64 and not reducing  
> > the  
> > > > > maximum  
> > > > > > row group size, even if this results in slightly larger footers.  
> > I've  
> > > > > heard  
> > > > > > from some of our users within G-Research that they do have files  
> > with  
> > > > row  
> > > > > > groups > 2 GiB. This is often when they use lower-level APIs to  
> > write  
> > > > > > Parquet that don't automatically split data into row groups, and  
> > they  
> > > > > > either write a single row group for simplicity or have some logical
> > > > > > partitioning of data into row groups. They might also have wide  
> > tables  
> > > > > with  
> > > > > > many columns, or wide array/tensor valued columns that lead to  
> > large  
> > > > row  
> > > > > > groups.
> > > > > >
> > > > > > In many workflows we don't read Parquet with a query engine that  
> > > > supports  
> > > > > > filters and skipping row groups, but just read all rows, or  
> > directly  
> > > > > > specify the row groups to read if there is some known logical  
> > > > > partitioning  
> > > > > > into row groups. I'm sure we could work around a 2 or 4 GiB row  
> > group  
> > > > > size  
> > > > > > limitation if we had to, but it's a new constraint that reduces the
> > > > > > flexibility of the format and makes more work for users who now  
> > need to  
> > > > > > ensure they don't hit this limit.
> > > > > >
> > > > > > Do you have any measurements of how much of a difference 4 byte  
> > offsets  
> > > > > > make to footer sizes in your data, with and without the optional  
> > LZ4  
> > > > > > compression?
> > > > > >
> > > > > > Thanks,
> > > > > > Adam
> > > > > >
> > > > > > On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos
> > > > > > <  
> > alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl79t-xmd5yjdbdmrexy1tmh2...@public.gmane.org>
> > wrote:  
> > > > > >  
> > > > > > > Hi all,
> > > > > > >
> > > > > > > From the comments on the [EXTERNAL] Parquet metadata
> > > > > > > <
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0
> >  
> > > > > > > >  
> > > > > > > document,
> > > > > > > it appears there's a general consensus on most aspects, with  
> > the  
> > > > > > exception  
> > > > > > > of the relative 32-bit offsets for column chunks.
> > > > > > >
> > > > > > > I'm starting this thread to discuss this topic further and work  
> > > > > towards a  
> > > > > > > resolution. Adam Reeve suggested raising the limitation to 2^32,  
> > and  
> > > > he  
> > > > > > > confirmed that Java does not have any issues with this. I am  
> > open to  
> > > > > this  
> > > > > > > change as it increases the limit without introducing any  
> > drawbacks.  
> > > > > > >
> > > > > > > However, some still feel that a 2^32-byte limit for a row group  
> > is  
> > > > too  
> > > > > > > restrictive. I'd like to understand these specific use cases  
> > better.  
> > > > > From  
> > > > > > > my perspective, for most engines, the row group is the primary  
> > unit  
> > > > of  
> > > > > > > skipping, making very large row groups less desirable. In our  
> > fleet's  
> > > > > > > workloads, it's rare to see row groups larger than 100MB, as  
> > anything  
> > > > > > > larger tends to make statistics-based skipping ineffective.
> > > > > > >
> > > > > > > Cheers,
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> >
> >
> >
> >  
>

Re: [DISCUSS] flatbuf footer: offsets

Reply via email to