Hi all, First, thank you to everyone who joined the meeting and made the effort to help improve Parquet. Here is a recap of the meeting.
In the meeting, we defined four goals that we want the new footer design to achieve: - *1. Minimize the number of dependent fetches* Today, for modern Parquet readers, we usually perform one speculative read to determine the exact footer size. If the footer is larger than expected, we need to issue another fetch. These two fetches are sequential and cannot be avoided under the current spec. - *2. Minimize footer size* Beyond reducing the number of fetches, we also want to minimize the total size of the footer, since footer size directly contributes to overall footer fetch latency. - *3. Make footer work proportional to projection* For example, when reading 5 out of 1,000 columns, the metadata work should ideally be close to 0.5% of reading the entire footer. - *4. **Make metadata decoding fast in absolute terms* Even with projection-proportional footer work, metadata decoding should be fast in absolute terms. Today, parsing is slow due to Thrift overhead and the complex structure of the footer, so the new design should also reduce decoding overhead. We also discussed a benchmark set for comparing footer performance across designs: 1. *Selective queries on ultra-wide tables* Reading a small subset of columns should not require parsing metadata for every column chunk. 2. *Many small row groups for point lookups* More row groups provide finer-grained statistics for pruning, but they also increase footer size because each row group adds more column statistics. Checking statistics for one filter column should not require parsing metadata for all columns across all row groups. 3. *Large statistics increasing footer I/O* Large statistics, especially string min/max values, should not force extra footer reads when the query does not need them. 4. *Deeply nested schemas* More group nodes and longer column paths increase schema complexity. Measure the impact of deeply nested schemas on footer decoding and schema resolution time. 5. *Late materialization* Late materialization should be able to fetch row group statistics, ColumnIndex, and OffsetIndex only for the needed columns, without parsing the full footer first or fetching scattered indexes inefficiently. We also discussed the edge conditions that the new Parquet footer work should target. Cases beyond these limits are considered out of scope for the current design discussion: 1. Number of columns: up to 100K 2. Number of row groups: up to 100K 3. Number of pages in a single column chunk: up to 100K 4. Nested struct depth: up to 100 levels 5. Highly asymmetric columns: up to a 10,000:1 size ratio between columns As the next step, I will polish the benchmark set and send a separate email to the mailing list to ask for community review. Any advice, ideas, or additional benchmark scenarios are welcome. The rough agenda for the next session is to start discussing the core design sketch of the modular footer. Each individual module will need a more detailed design later. Looking forward to having more people join the discussion and help shape the solution. Best, Jiayi Jiayi Wang <[email protected]> 于2026年6月11日周四 18:09写道: > Hi everyone, > > The next Parquet footer sync is today Thursday Jun 11th at 10am PT - 1pm > ET - 7pm CET (in ~1h). > > To join the invite, please subscribe to the Parquet calendar: > > https://calendar.google.com/calendar/u/0?cid=OTU2YjU0NGRlMGM5MjEwYTZlYzg1OWNhMGNjYTViMTZjNjhlYmRkNmJjYmY0ZmRlNGIzZjFjZTE5MzUwZGU2NkBncm91cC5jYWxlbmRhci5nb29nbGUuY29t > Video call link: https://meet.google.com/fix-tsgt-ssd > > Best, > Jiayi > > Jiayi Wang <[email protected]> 于2026年6月9日周二 21:40写道: > >> Hi Antoine, >> >> Thank you for pointing this out. I agree with your concern. >> >> My intention was not to make decisions during sync calls and only report >> them afterward. I understand that for a foundational design such as a new >> Parquet footer, the mailing list should be the primary place where design >> discussions, trade-offs, objections, and decisions happen, so that everyone >> can participate asynchronously. >> >> I will make sure that any sync calls are used only for clarification, >> gathering input, and brainstorming, and that no binding decisions are made >> there. For anything substantial, I will summarize the discussion on the >> mailing list, raise the concrete proposal there, and leave enough time for >> feedback before considering a decision. >> >> Thanks for the reminder! >> >> To clarify the agenda for next sync session: last time, we proposed >> establishing a set of footer examples that perform poorly under the current >> spec. These examples would serve as evaluation cases for the new footer >> design. In the upcoming session, we will review the proposed footer >> examples and brainstorm additional scenarios or ideas. After that, we will >> bring the summary, examples, and open questions to the official mailing >> list for review and further input from the community. >> >> I hope this addresses your concerns. >> >> Best, >> Jiayi >> >> Antoine Pitrou <[email protected]> 于2026年6月9日周二 16:23写道: >> >>> >>> Hello Jiayi, >>> >>> Le 09/06/2026 à 11:24, Jiayi Wang a écrit : >>> > >>> > Please note that all decisions will be communicated publicly via the >>> > mailing list, and I will also keep the mailing list updated on >>> progress. >>> >>> This is good, but I don't think it's sufficient. >>> >>> The "Apache Way" stresses Open Communications: >>> >>> """as a virtual organization, the ASF requires all communications >>> related to code and decision-making to be publicly accessible to ensure >>> asynchronous collaboration, as necessitated by a globally-distributed >>> community.""" >>> >>> (from https://www.apache.org/theapacheway/) >>> >>> Making decisions during sync video calls does not allow for asynchronous >>> collaboration. Designing a new Parquet footer is foundational for the >>> project, we should not skimp on proper collaboration practices. >>> >>> Regards >>> >>> Antoine. >>> >>> >>>
