Hi Huaxin, Peter, Happy to collaborate on this as well. I added some comments to the document. A while ago we had a proposal to have a pluggable interface for file filtering which may be relevant for this discussion: https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit?tab=t.0#heading=h.uqr5wcfm85p7
Thanks, Guy On 2026/01/12 10:53:24 Péter Váry wrote: > Cool! > Happy to collaborate on this! > > > keep only minimal snapshot references in table metadata and move the > richer index definition and lifecycle into catalog‑managed index metadata > exposed via the REST APIs. > > In my second iteration, I moved the snapshot references into the index > metadata [1]. This allows the query engine to fetch indexes in parallel > with the table metadata using *catalog.listIndexes*, where each returned > *BaseIndex* already includes the available table snapshots. > With that information, the engine can immediately determine whether a given > index is applicable for the query by checking the index type, index > columns, and the associated table snapshots. > If the engine decides to use a particular index, it can then retrieve the > corresponding DetailedIndex, which contains all additional details required > by the engine. > For Bloom filter indexes specifically, the *IndexSnapshots* could store the > correct Puffin file path for each table snapshot in their snapshot > properties. > > [1] - Iceberg indexes / Index Metadata / Snapshot - > https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.r3lv3a6k06hy > > huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 12., H, > 2:27): > > > Hi Peter, > > > > > > Thanks a lot for sharing the proposal in [1] and for the detailed design. > > The catalog‑managed index framework there looks like a better long‑term > > direction than keeping full index definitions in table metadata. > > > > > > The current Bloom‑filter draft describes indexes in table metadata so > > planners can discover them during planning and map table snapshots to > > Puffin files with Bloom filters, but that wiring can be changed easily to > > the catalog‑based model in [1]: keep only minimal snapshot references in > > table metadata and move the richer index definition and lifecycle into > > catalog‑managed index metadata exposed via the REST APIs. In that model, > > the Bloom‑filter file‑skipping index would be one concrete `IndexType` > > whose data lives in Puffin files, with engines discovering and loading it > > through the catalog (`listIndexes`, `loadIndex`, etc.). > > > > > > Agree that the Bloom‑filter index would be an excellent candidate and a > > very good fit as the first index type to implement in this framework, and > > the proposal will be updated to follow the catalog‑based approach. > > > > > > Best, > > > > Huaxin > > > > > > > > > > > > On Fri, Jan 9, 2026 at 11:59 AM Péter Váry <[email protected]> > > wrote: > > > >> Hi Huaxin, > >> > >> This is a very interesting topic. We’re also working on an index proposal > >> [1] that aligns closely with yours in many areas. In an earlier iteration, > >> I considered adding index metadata directly to the table metadata as well. > >> After some back-and-forth, we ultimately moved to a different approach, > >> where the catalog exposes an API to fetch the indexes for a given table. > >> > >> This has several advantages—for example, it avoids increasing the size of > >> the table metadata and is more consistent with existing practices where > >> UDFs, views, and materialized views each have their own specifications and > >> metadata. > >> > >> After reading your proposal, I think the bloom filter index would be an > >> excellent candidate and a very good fit as a first index type to implement, > >> helping us evaluate the viability of the metadata approach. > >> > >> Please take a look and let me know what you think. > >> Thanks, > >> Peter > >> > >> [1] - > >> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0 > >> > >> > >> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 8., > >> Cs, 17:27): > >> > >>> Hi Iceberg community, > >>> > >>> I’d like to request feedback on a proposal > >>> < https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0> > >>> to introduce secondary indexes to Apache Iceberg with a narrow, incremental > >>> scope. > >>> > >>> Phase 1 adds file-skipping indexes based on per-column Bloom filters, > >>> stored in Puffin and referenced from table metadata so query engines can > >>> use them during planning to prune data files. Indexes are advisory-only and > >>> snapshot-scoped. The proposal is fully backward compatible: engines that > >>> don’t understand the new metadata fields ignore them. > >>> > >>> I’d appreciate any feedback, questions, or concerns on the overall > >>> direction and design. > >>> > >>> Best, > >>> > >>> Huaxin > >>> > >> >
