Re: Parquet Footer Working Group - session 2

Jiayi Wang Thu, 11 Jun 2026 13:38:46 -0700

Hi all,

First, thank you to everyone who joined the meeting and made the effort to
help improve Parquet. Here is a recap of the meeting.


In the meeting, we defined four goals that we want the new footer design to
achieve:

   - *1. Minimize the number of dependent fetches*
   Today, for modern Parquet readers, we usually perform one speculative
   read to determine the exact footer size. If the footer is larger than
   expected, we need to issue another fetch. These two fetches are sequential
   and cannot be avoided under the current spec.
   - *2. Minimize footer size*
   Beyond reducing the number of fetches, we also want to minimize the
   total size of the footer, since footer size directly contributes to overall
   footer fetch latency.
   - *3. Make footer work proportional to projection*
   For example, when reading 5 out of 1,000 columns, the metadata work
   should ideally be close to 0.5% of reading the entire footer.
   - *4. **Make metadata decoding fast in absolute terms*
   Even with projection-proportional footer work, metadata decoding should
   be fast in absolute terms. Today, parsing is slow due to Thrift overhead
   and the complex structure of the footer, so the new design should also
   reduce decoding overhead.

We also discussed a benchmark set for comparing footer performance across
designs:

   1. *Selective queries on ultra-wide tables*
   Reading a small subset of columns should not require parsing metadata
   for every column chunk.
   2. *Many small row groups for point lookups*
   More row groups provide finer-grained statistics for pruning, but they
   also increase footer size because each row group adds more column
   statistics. Checking statistics for one filter column should not require
   parsing metadata for all columns across all row groups.
   3. *Large statistics increasing footer I/O*
   Large statistics, especially string min/max values, should not force
   extra footer reads when the query does not need them.
   4. *Deeply nested schemas*
   More group nodes and longer column paths increase schema complexity.
   Measure the impact of deeply nested schemas on footer decoding and schema
   resolution time.
   5. *Late materialization*
   Late materialization should be able to fetch row group statistics,
   ColumnIndex, and OffsetIndex only for the needed columns, without
   parsing the full footer first or fetching scattered indexes inefficiently.

We also discussed the edge conditions that the new Parquet footer work
should target. Cases beyond these limits are considered out of scope for
the current design discussion:

   1. Number of columns: up to 100K
   2. Number of row groups: up to 100K
   3. Number of pages in a single column chunk: up to 100K
   4. Nested struct depth: up to 100 levels
   5. Highly asymmetric columns: up to a 10,000:1 size ratio between columns

As the next step, I will polish the benchmark set and send a separate email
to the mailing list to ask for community review. Any advice, ideas, or
additional benchmark scenarios are welcome.

The rough agenda for the next session is to start discussing the core
design sketch of the modular footer. Each individual module will need a
more detailed design later. Looking forward to having more people join the
discussion and help shape the solution.

Best,
Jiayi

Jiayi Wang <[email protected]> 于2026年6月11日周四 18:09写道：

> Hi everyone,
>
> The next Parquet footer sync is today Thursday Jun 11th at 10am PT - 1pm
> ET - 7pm CET (in ~1h).
>
> To join the invite, please subscribe to the Parquet calendar:
>
> https://calendar.google.com/calendar/u/0?cid=OTU2YjU0NGRlMGM5MjEwYTZlYzg1OWNhMGNjYTViMTZjNjhlYmRkNmJjYmY0ZmRlNGIzZjFjZTE5MzUwZGU2NkBncm91cC5jYWxlbmRhci5nb29nbGUuY29t
> Video call link: https://meet.google.com/fix-tsgt-ssd
>
> Best,
> Jiayi
>
> Jiayi Wang <[email protected]> 于2026年6月9日周二 21:40写道：
>
>> Hi Antoine,
>>
>> Thank you for pointing this out. I agree with your concern.
>>
>> My intention was not to make decisions during sync calls and only report
>> them afterward. I understand that for a foundational design such as a new
>> Parquet footer, the mailing list should be the primary place where design
>> discussions, trade-offs, objections, and decisions happen, so that everyone
>> can participate asynchronously.
>>
>> I will make sure that any sync calls are used only for clarification,
>> gathering input, and brainstorming, and that no binding decisions are made
>> there. For anything substantial, I will summarize the discussion on the
>> mailing list, raise the concrete proposal there, and leave enough time for
>> feedback before considering a decision.
>>
>> Thanks for the reminder!
>>
>> To clarify the agenda for next sync session: last time, we proposed
>> establishing a set of footer examples that perform poorly under the current
>> spec. These examples would serve as evaluation cases for the new footer
>> design. In the upcoming session, we will review the proposed footer
>> examples and brainstorm additional scenarios or ideas. After that, we will
>> bring the summary, examples, and open questions to the official mailing
>> list for review and further input from the community.
>>
>> I hope this addresses your concerns.
>>
>> Best,
>> Jiayi
>>
>> Antoine Pitrou <[email protected]> 于2026年6月9日周二 16:23写道：
>>
>>>
>>> Hello Jiayi,
>>>
>>> Le 09/06/2026 à 11:24, Jiayi Wang a écrit :
>>> >
>>> > Please note that all decisions will be communicated publicly via the
>>> > mailing list, and I will also keep the mailing list updated on
>>> progress.
>>>
>>> This is good, but I don't think it's sufficient.
>>>
>>> The "Apache Way" stresses Open Communications:
>>>
>>> """as a virtual organization, the ASF requires all communications
>>> related to code and decision-making to be publicly accessible to ensure
>>> asynchronous collaboration, as necessitated by a globally-distributed
>>> community."""
>>>
>>> (from https://www.apache.org/theapacheway/)
>>>
>>> Making decisions during sync video calls does not allow for asynchronous
>>> collaboration. Designing a new Parquet footer is foundational for the
>>> project, we should not skimp on proper collaboration practices.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>>

Re: Parquet Footer Working Group - session 2

Reply via email to