Hi Peter/Huaxin, This is a very interesting topic—thank you for sharing all the documentation. I have a few questions I hope you can clarify:
Does this mean that the three types of indexes—B-Tree, Full-Text, and IVF—can all be addressed through the use of materialized views? Or are there scenarios where dedicated index structures are still necessary? Doc <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0> referred I’m also interested in the current roadmap for secondary indexes. Are there any concrete plans or timelines for their introduction in upcoming releases? Additionally, is there a draft or active pull request for this feature? I am happy to collaborate on this topic. Thank you in advance for your insights! Regards, Vaibhav On Tue, Jan 13, 2026 at 6:43 AM huaxin gao <[email protected]> wrote: > Hi Peter, > > Thanks for the clarification. I will align the secondary index proposal > accordingly. > > Looking forward to the collaboration! > > Best, > Huaxin > > On Mon, Jan 12, 2026 at 2:54 AM Péter Váry <[email protected]> > wrote: > >> Cool! >> Happy to collaborate on this! >> >> > keep only minimal snapshot references in table metadata and move the >> richer index definition and lifecycle into catalog‑managed index metadata >> exposed via the REST APIs. >> >> In my second iteration, I moved the snapshot references into the index >> metadata [1]. This allows the query engine to fetch indexes in parallel >> with the table metadata using *catalog.listIndexes*, where each returned >> *BaseIndex* already includes the available table snapshots. >> With that information, the engine can immediately determine whether a >> given index is applicable for the query by checking the index type, index >> columns, and the associated table snapshots. >> If the engine decides to use a particular index, it can then retrieve the >> corresponding DetailedIndex, which contains all additional details required >> by the engine. >> For Bloom filter indexes specifically, the *IndexSnapshots* could store >> the correct Puffin file path for each table snapshot in their snapshot >> properties. >> >> [1] - Iceberg indexes / Index Metadata / Snapshot - >> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.r3lv3a6k06hy >> >> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 12., >> H, 2:27): >> >>> Hi Peter, >>> >>> >>> Thanks a lot for sharing the proposal in [1] and for the detailed >>> design. The catalog‑managed index framework there looks like a better >>> long‑term direction than keeping full index definitions in table metadata. >>> >>> >>> The current Bloom‑filter draft describes indexes in table metadata so >>> planners can discover them during planning and map table snapshots to >>> Puffin files with Bloom filters, but that wiring can be changed easily to >>> the catalog‑based model in [1]: keep only minimal snapshot references in >>> table metadata and move the richer index definition and lifecycle into >>> catalog‑managed index metadata exposed via the REST APIs. In that model, >>> the Bloom‑filter file‑skipping index would be one concrete `IndexType` >>> whose data lives in Puffin files, with engines discovering and loading it >>> through the catalog (`listIndexes`, `loadIndex`, etc.). >>> >>> >>> Agree that the Bloom‑filter index would be an excellent candidate and a >>> very good fit as the first index type to implement in this framework, and >>> the proposal will be updated to follow the catalog‑based approach. >>> >>> >>> Best, >>> >>> Huaxin >>> >>> >>> >>> >>> >>> On Fri, Jan 9, 2026 at 11:59 AM Péter Váry <[email protected]> >>> wrote: >>> >>>> Hi Huaxin, >>>> >>>> This is a very interesting topic. We’re also working on an index >>>> proposal [1] that aligns closely with yours in many areas. In an earlier >>>> iteration, I considered adding index metadata directly to the table >>>> metadata as well. After some back-and-forth, we ultimately moved to a >>>> different approach, where the catalog exposes an API to fetch the indexes >>>> for a given table. >>>> >>>> This has several advantages—for example, it avoids increasing the size >>>> of the table metadata and is more consistent with existing practices where >>>> UDFs, views, and materialized views each have their own specifications and >>>> metadata. >>>> >>>> After reading your proposal, I think the bloom filter index would be an >>>> excellent candidate and a very good fit as a first index type to implement, >>>> helping us evaluate the viability of the metadata approach. >>>> >>>> Please take a look and let me know what you think. >>>> Thanks, >>>> Peter >>>> >>>> [1] - >>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0 >>>> >>>> >>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 8., >>>> Cs, 17:27): >>>> >>>>> Hi Iceberg community, >>>>> >>>>> I’d like to request feedback on a proposal >>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0> >>>>> to introduce secondary indexes to Apache Iceberg with a narrow, >>>>> incremental >>>>> scope. >>>>> >>>>> Phase 1 adds file-skipping indexes based on per-column Bloom filters, >>>>> stored in Puffin and referenced from table metadata so query engines can >>>>> use them during planning to prune data files. Indexes are advisory-only >>>>> and >>>>> snapshot-scoped. The proposal is fully backward compatible: engines that >>>>> don’t understand the new metadata fields ignore them. >>>>> >>>>> I’d appreciate any feedback, questions, or concerns on the overall >>>>> direction and design. >>>>> >>>>> Best, >>>>> >>>>> Huaxin >>>>> >>>>
