Hello Alkis, Do you think you could add your footer proposal to the proposals page?
https://github.com/apache/parquet-format/tree/master/proposals#active-proposals That way it gets more visibility. Cheers Julien On Tue, Oct 21, 2025 at 11:49 AM Steve Loughran <[email protected]> wrote: > On Mon, 20 Oct 2025 at 18:24, Ed Seidl <[email protected]> wrote: > > > IIUC a flatbuffer aware decoder would read the last 36 bytes or so of the > > file and look for a known UUID along with size information. With this it > > could then read only the flatbuffer bytes. I think this would work as > well > > as current systems that prefetch some number of bytes in an attempt to > get > > the whole footer in a single get. > > > > Old readers, however, will have to fetch both footers, but won't have any > > additional decoding work because the new footer is a binary field that > can > > be easily skipped. > > > > really depends what the readers do with footer prefetching. For the java > clients > > > 1. s3a classic stream: the backwards seek() switches it to random IO > mode, next read() from base of thrift will pull in > fs.s3a.readahead.range > of data No penalty > 2. google gs://. There's a footer cache option which will need to be set > to a larger value > 3. azure abfs:// there's a footer cache option which will need to be set > to a larger value > 4. s3a + amazon analytics stream. This stream is *parquet aware* and > actually parses the footer to know what to predictively prefetch. The > AWS > developers do know of this work -moving to support the new footer would > be > the ideal strategy here. > 5. Iceberg classic input. no idea. > 6. iceberg + amazon analytics. same as S3A though without some of the > tuning we've been doing for vector reads. > > I wouldn't worry too much about the impact of that footer size increase. > Some extra footer prefetch options should compensate, and once apps move to > a parquet v3 reader they've got a faster parse time. Of course, ironically, > read time then may dominate even more there -it'll be important to do that > read as efficiently as possible (use a readFully() into a buffer, not lots > of single byte read() calls) >
