On Mon, 20 Oct 2025 at 18:24, Ed Seidl <[email protected]> wrote:

> IIUC a flatbuffer aware decoder would read the last 36 bytes or so of the
> file and look for a known UUID along with size information. With this it
> could then read only the flatbuffer bytes. I think this would work as well
> as current systems that prefetch some number of bytes in an attempt to get
> the whole footer in a single get.
>
> Old readers, however, will have to fetch both footers, but won't have any
> additional decoding work because the new footer is a binary field that can
> be easily skipped.
>

really depends what the readers do with footer prefetching. For the java
clients


   1. s3a classic stream: the backwards seek()  switches it to random IO
   mode, next read() from base of thrift will pull in fs.s3a.readahead.range
   of data  No penalty
   2. google gs://. There's a footer cache option which will need to be set
   to a larger value
   3. azure abfs:// there's a footer cache option which will need to be set
   to a larger value
   4. s3a + amazon analytics stream. This stream is *parquet aware* and
   actually parses the footer to know what to predictively prefetch. The AWS
   developers do know of this work -moving to support the new footer would be
   the ideal strategy here.
   5. Iceberg classic input. no idea.
   6. iceberg + amazon analytics. same as S3A though without some of the
   tuning we've been doing for vector reads.

I wouldn't worry too much about the impact of that footer size increase.
Some extra footer prefetch options should compensate, and once apps move to
a parquet v3 reader they've got a faster parse time. Of course, ironically,
read time then may dominate even more there -it'll be important to do that
read as efficiently as possible (use a readFully() into a buffer, not lots
of single byte read() calls)

Reply via email to