Hello Alkis,
Do you think you could add your footer proposal to the proposals page?

https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
That way it gets more visibility.
Cheers
Julien

On Tue, Oct 21, 2025 at 11:49 AM Steve Loughran <[email protected]>
wrote:

> On Mon, 20 Oct 2025 at 18:24, Ed Seidl <[email protected]> wrote:
>
> > IIUC a flatbuffer aware decoder would read the last 36 bytes or so of the
> > file and look for a known UUID along with size information. With this it
> > could then read only the flatbuffer bytes. I think this would work as
> well
> > as current systems that prefetch some number of bytes in an attempt to
> get
> > the whole footer in a single get.
> >
> > Old readers, however, will have to fetch both footers, but won't have any
> > additional decoding work because the new footer is a binary field that
> can
> > be easily skipped.
> >
>
> really depends what the readers do with footer prefetching. For the java
> clients
>
>
>    1. s3a classic stream: the backwards seek()  switches it to random IO
>    mode, next read() from base of thrift will pull in
> fs.s3a.readahead.range
>    of data  No penalty
>    2. google gs://. There's a footer cache option which will need to be set
>    to a larger value
>    3. azure abfs:// there's a footer cache option which will need to be set
>    to a larger value
>    4. s3a + amazon analytics stream. This stream is *parquet aware* and
>    actually parses the footer to know what to predictively prefetch. The
> AWS
>    developers do know of this work -moving to support the new footer would
> be
>    the ideal strategy here.
>    5. Iceberg classic input. no idea.
>    6. iceberg + amazon analytics. same as S3A though without some of the
>    tuning we've been doing for vector reads.
>
> I wouldn't worry too much about the impact of that footer size increase.
> Some extra footer prefetch options should compensate, and once apps move to
> a parquet v3 reader they've got a faster parse time. Of course, ironically,
> read time then may dominate even more there -it'll be important to do that
> read as efficiently as possible (use a readFully() into a buffer, not lots
> of single byte read() calls)
>

Reply via email to