On Mon, 20 Oct 2025 at 18:24, Ed Seidl <[email protected]> wrote: > IIUC a flatbuffer aware decoder would read the last 36 bytes or so of the > file and look for a known UUID along with size information. With this it > could then read only the flatbuffer bytes. I think this would work as well > as current systems that prefetch some number of bytes in an attempt to get > the whole footer in a single get. > > Old readers, however, will have to fetch both footers, but won't have any > additional decoding work because the new footer is a binary field that can > be easily skipped. >
really depends what the readers do with footer prefetching. For the java clients 1. s3a classic stream: the backwards seek() switches it to random IO mode, next read() from base of thrift will pull in fs.s3a.readahead.range of data No penalty 2. google gs://. There's a footer cache option which will need to be set to a larger value 3. azure abfs:// there's a footer cache option which will need to be set to a larger value 4. s3a + amazon analytics stream. This stream is *parquet aware* and actually parses the footer to know what to predictively prefetch. The AWS developers do know of this work -moving to support the new footer would be the ideal strategy here. 5. Iceberg classic input. no idea. 6. iceberg + amazon analytics. same as S3A though without some of the tuning we've been doing for vector reads. I wouldn't worry too much about the impact of that footer size increase. Some extra footer prefetch options should compensate, and once apps move to a parquet v3 reader they've got a faster parse time. Of course, ironically, read time then may dominate even more there -it'll be important to do that read as efficiently as possible (use a readFully() into a buffer, not lots of single byte read() calls)
