Hi,
I expect LZ4 to be optional, but enabled by default by most writers. LZ4 decompression is extremely fast, typically several GB/s on a modern CPU. Regards Antoine. On Mon, 27 Oct 2025 17:06:07 +0100 Jan Finis <[email protected]> wrote: > You are right that even without LZ4, we would still need I/O for the whole > footer. And I guess LZ4 is way faster than thrift, so flatbuf+LZ4 would be > an improvement over thrift. If you want superb partial decoding, we would > indeed need to somehow support only reading part of the footer from > storage. In the end, it's a trade-off. The more flexibility we want w.r.t. > partial reads, the more complexity we have to introduce. Maybe flatbuf > alone is already the sweet spot here and we shouldn't introduce additional > complexity. LZ4 compression would after all still be optional, right? > > Someone mentioned that they have footers with millions of columns. Maybe > they should comment on how much partial reading would be required for their > use case. I guess the answer will be "the more support for partial > reading/decoding the better". > > You could argue that if you have such a wide file, just don't use LZ4 then > and that's probably a valid argument. > > Cheers, > Jan > > > > Am Mo., 27. Okt. 2025 um 09:28 Uhr schrieb Antoine Pitrou < > [email protected]>: > > > > > Hmmm... does it? > > > > I may be mistaken, but I had the impression that what you call "read > > only the parts of the footer I'm interested in" is actually "*decode* > > only the parts of the footer I'm interested in". > > > > That is, you still read the entire footer, which is a larger IO than > > doing smaller reads, but it's also a single IO rather than several > > smaller ones. > > > > Of course, if we want to make things more flexible, we can have > > individual Flatbuffers metadata pieces for each column, each > > LZ4-compressed. And embed two sizes at the end of the file: the size of > > the "core footer" metadata (without columns) and the size of the "full > > footer" metadata (with columns); so that readers can choose their > > preferred strategy. > > > > Regards > > > > Antoine. > > > > > > On Sat, 25 Oct 2025 14:39:37 +0200 > > Jan Finis <[email protected]> wrote: > > > Note that LZ4 compression destroys the whole "I can read only the parts > > of > > > the footer I'm interested in", so I wouldn't say that LZ4 can be the > > > solution to everything. > > > > > > Cheers, > > > Jan > > > > > > On Sat, Oct 25, 2025, 12:33 Antoine Pitrou < > > [email protected]> wrote: > > > > > > > On Fri, 24 Oct 2025 12:12:02 -0700 > > > > Julien Le Dem <[email protected]> wrote: > > > > > I had an idea about this topic. > > > > > What if we say the offset is always a multiple of 16? (I'm saying > > 16, but > > > > > it works with 8 or 32 or any other power of 2). > > > > > Then we store in the footer the offset divided by 16. > > > > > That means you need to pad each row group by up to 16 bytes. > > > > > But now the max size of the file is 32GB. > > > > > > > > > > Personally, I still don't like having arbitrary limits but 32GB > > seems a > > > > lot > > > > > less like a restricting limit than 2GB. > > > > > If we get crazy, we add this to the footer as metadata and the > > writer > > > > gets > > > > > to pick whether you multiply offsets by 32, 64 or 128 if ten years > > from > > > > now > > > > > we start having much bigger files. > > > > > The size of the padding becomes negligible over the size of the file. > > > > > > > > > > Thoughts? > > > > > > > > That's an interesting suggestion. I would be fine with it personally, > > > > provided the multiplier is either large enough (say, 64) or embedded in > > > > the footer. > > > > > > > > That said, I would first wait for the outcome of the experiment with > > > > LZ4 compression. If it negates the additional cost of 64-bit offsets, > > > > then we should not bother with this multiplier mechanism. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > On Tue, Oct 21, 2025 at 6:19 AM Alkis Evlogimenos > > > > > <alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl...@public.gmane.org> > > > > > wrote: > > > > > > > > > > > We've analyzed a large footer from our production environment to > > > > understand > > > > > > byte distribution across its fields. The detailed analysis is > > > > available in > > > > > > the proposal document here: > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.o2lsuuyi8rw6#heading=h.26i914tjp4fk > > > > > > > > . > > > > > > > > > > > > To illustrate the impact of 64-bit fields, we conducted an > > experiment > > > > where > > > > > > all proposed 32-bit fields in the Flatbuf footer were changed to > > > > 64-bit. > > > > > > This resulted in a *40% increase* in footer size. > > > > > > > > > > > > That said, LZ4 manages to compress this away. We will do some > > more > > > > testing > > > > > > with 64 bit offsets/numvals/sizes and revert back. If it all goes > > well > > > > we > > > > > > can resolve this by going 64 bit. > > > > > > > > > > > > > > > > > > On Wed, Oct 15, 2025 at 12:49 PM Jan Finis < > > > > jpfinis-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org> > > > > wrote: > > > > > > > > > > > > > Hi Alkis, > > > > > > > > > > > > > > one more very simple argument why you want these offsets to be > > i64: > > > > > > > What if you want to store a single value larger than 4GB? I know > > this > > > > > > > sounds absurd at first, but some use cases might want to store > > data > > > > that > > > > > > > can sometimes be very large (e.g. blob data, or insanely > > complex > > > > geo > > > > > > data). > > > > > > > And it would be a shame if that would mean that they cannot use > > > > Parquet > > > > > > at > > > > > > > all. > > > > > > > > > > > > > > Thus, my opinion here is that we can limit to i32 all fields > > that > > > > the > > > > > > file > > > > > > > writer has under control, e.g., the number of rows within a row > > > > group, > > > > > > but > > > > > > > we shouldn't limit any values that a file writer doesn't have > > under > > > > > > > control, as they fully depend on the input data. > > > > > > > > > > > > > > Note though that this means that the number of values in a > > column > > > > chunk > > > > > > > could also exceed i32, if a user has nested data with more than > > 4 > > > > billion > > > > > > > entries. With such data, the file writer again couldn't do > > anything > > > > to > > > > > > > avoid writing a row group with more > > > > > > > than i32 values, as a single row may not span multiple row > > groups. > > > > That > > > > > > > being said, I think that nested data with more than 4 billion > > > > entries is > > > > > > > less likely than a single large blob of 4 billion bytes. > > > > > > > > > > > > > > I know that smaller row groups is what most / all engines > > prefer, > > > > but we > > > > > > > have to make sure the format also works for edge cases. > > > > > > > > > > > > > > Cheers, > > > > > > > Jan > > > > > > > > > > > > > > Am Mi., 15. Okt. 2025 um 05:05 Uhr schrieb Adam Reeve > > > > <adreeve-Re5JQEeQqe8-XMD5yJDbdMReXY1tMh2IBgC/[email protected] > > > > > > >: > > > > > > > > > > > > > > > Hi Alkis > > > > > > > > > > > > > > > > Thanks for all your work on this proposal. > > > > > > > > > > > > > > > > I'd be in favour of keeping the offsets as i64 and not > > reducing > > > > the > > > > > > > maximum > > > > > > > > row group size, even if this results in slightly larger > > footers. > > > > I've > > > > > > > heard > > > > > > > > from some of our users within G-Research that they do have > > files > > > > with > > > > > > row > > > > > > > > groups > 2 GiB. This is often when they use lower-level APIs > > to > > > > write > > > > > > > > Parquet that don't automatically split data into row groups, > > and > > > > they > > > > > > > > either write a single row group for simplicity or have some > > logical > > > > > > > > partitioning of data into row groups. They might also have > > wide > > > > tables > > > > > > > with > > > > > > > > many columns, or wide array/tensor valued columns that lead > > to > > > > large > > > > > > row > > > > > > > > groups. > > > > > > > > > > > > > > > > In many workflows we don't read Parquet with a query engine > > that > > > > > > supports > > > > > > > > filters and skipping row groups, but just read all rows, or > > > > directly > > > > > > > > specify the row groups to read if there is some known logical > > > > > > > partitioning > > > > > > > > into row groups. I'm sure we could work around a 2 or 4 GiB > > row > > > > group > > > > > > > size > > > > > > > > limitation if we had to, but it's a new constraint that > > reduces the > > > > > > > > flexibility of the format and makes more work for users who > > now > > > > need to > > > > > > > > ensure they don't hit this limit. > > > > > > > > > > > > > > > > Do you have any measurements of how much of a difference 4 > > byte > > > > offsets > > > > > > > > make to footer sizes in your data, with and without the > > optional > > > > LZ4 > > > > > > > > compression? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Adam > > > > > > > > > > > > > > > > On Tue, 14 Oct 2025 at 21:02, Alkis Evlogimenos > > > > > > > > < > > > > > > alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl79t-xmd5yjdbdmrexy1tmh2...@public.gmane.org > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > From the comments on the [EXTERNAL] Parquet metadata > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0 > > > > > > > > > > > > > > > > > > > > > > > > > document, > > > > > > > > > it appears there's a general consensus on most aspects, > > with > > > > the > > > > > > > > exception > > > > > > > > > of the relative 32-bit offsets for column chunks. > > > > > > > > > > > > > > > > > > I'm starting this thread to discuss this topic further and > > work > > > > > > > towards a > > > > > > > > > resolution. Adam Reeve suggested raising the limitation to > > 2^32, > > > > and > > > > > > he > > > > > > > > > confirmed that Java does not have any issues with this. I > > am > > > > open to > > > > > > > this > > > > > > > > > change as it increases the limit without introducing any > > > > drawbacks. > > > > > > > > > > > > > > > > > > However, some still feel that a 2^32-byte limit for a row > > group > > > > is > > > > > > too > > > > > > > > > restrictive. I'd like to understand these specific use > > cases > > > > better. > > > > > > > From > > > > > > > > > my perspective, for most engines, the row group is the > > primary > > > > unit > > > > > > of > > > > > > > > > skipping, making very large row groups less desirable. In > > our > > > > fleet's > > > > > > > > > workloads, it's rare to see row groups larger than 100MB, > > as > > > > anything > > > > > > > > > larger tends to make statistics-based skipping ineffective. > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
