Re: [DISCUSS] Secondary Indexes (Phase 1): Bloom filter skipping index (Puffin, snapshot-scoped)

Vaibhav Kumar Tue, 13 Jan 2026 03:22:29 -0800

Hi Peter/Huaxin,

This is a very interesting topic—thank you for sharing all the
documentation. I have a few questions I hope you can clarify:


Does this mean that the three types of indexes—B-Tree, Full-Text, and
IVF—can all be addressed through the use of materialized views? Or are
there scenarios where dedicated index structures are still necessary? Doc
<https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0>
referred

I’m also interested in the current roadmap for secondary indexes. Are there
any concrete plans or timelines for their introduction in upcoming
releases? Additionally, is there a draft or active pull request for this
feature? I am happy to collaborate on this topic.

Thank you in advance for your insights!

Regards,
Vaibhav


On Tue, Jan 13, 2026 at 6:43 AM huaxin gao <[email protected]> wrote:

> Hi Peter,
>
> Thanks for the clarification. I will align the secondary index proposal
> accordingly.
>
> Looking forward to the collaboration!
>
> Best,
> Huaxin
>
> On Mon, Jan 12, 2026 at 2:54 AM Péter Váry <[email protected]>
> wrote:
>
>> Cool!
>> Happy to collaborate on this!
>>
>> > keep only minimal snapshot references in table metadata and move the
>> richer index definition and lifecycle into catalog‑managed index metadata
>> exposed via the REST APIs.
>>
>> In my second iteration, I moved the snapshot references into the index
>> metadata [1]. This allows the query engine to fetch indexes in parallel
>> with the table metadata using *catalog.listIndexes*, where each returned
>> *BaseIndex* already includes the available table snapshots.
>> With that information, the engine can immediately determine whether a
>> given index is applicable for the query by checking the index type, index
>> columns, and the associated table snapshots.
>> If the engine decides to use a particular index, it can then retrieve the
>> corresponding DetailedIndex, which contains all additional details required
>> by the engine.
>> For Bloom filter indexes specifically, the *IndexSnapshots* could store
>> the correct Puffin file path for each table snapshot in their snapshot
>> properties.
>>
>> [1] - Iceberg indexes / Index Metadata / Snapshot -
>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.r3lv3a6k06hy
>>
>> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 12.,
>> H, 2:27):
>>
>>> Hi Peter,
>>>
>>>
>>> Thanks a lot for sharing the proposal in [1] and for the detailed
>>> design. The catalog‑managed index framework there looks like a better
>>> long‑term direction than keeping full index definitions in table metadata.
>>>
>>>
>>> The current Bloom‑filter draft describes indexes in table metadata so
>>> planners can discover them during planning and map table snapshots to
>>> Puffin files with Bloom filters, but that wiring can be changed easily to
>>> the catalog‑based model in [1]: keep only minimal snapshot references in
>>> table metadata and move the richer index definition and lifecycle into
>>> catalog‑managed index metadata exposed via the REST APIs. In that model,
>>> the Bloom‑filter file‑skipping index would be one concrete `IndexType`
>>> whose data lives in Puffin files, with engines discovering and loading it
>>> through the catalog (`listIndexes`, `loadIndex`, etc.).
>>>
>>>
>>> Agree that the Bloom‑filter index would be an excellent candidate and a
>>> very good fit as the first index type to implement in this framework, and
>>> the proposal will be updated to follow the catalog‑based approach.
>>>
>>>
>>> Best,
>>>
>>> Huaxin
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jan 9, 2026 at 11:59 AM Péter Váry <[email protected]>
>>> wrote:
>>>
>>>> Hi Huaxin,
>>>>
>>>> This is a very interesting topic. We’re also working on an index
>>>> proposal [1] that aligns closely with yours in many areas. In an earlier
>>>> iteration, I considered adding index metadata directly to the table
>>>> metadata as well. After some back-and-forth, we ultimately moved to a
>>>> different approach, where the catalog exposes an API to fetch the indexes
>>>> for a given table.
>>>>
>>>> This has several advantages—for example, it avoids increasing the size
>>>> of the table metadata and is more consistent with existing practices where
>>>> UDFs, views, and materialized views each have their own specifications and
>>>> metadata.
>>>>
>>>> After reading your proposal, I think the bloom filter index would be an
>>>> excellent candidate and a very good fit as a first index type to implement,
>>>> helping us evaluate the viability of the metadata approach.
>>>>
>>>> Please take a look and let me know what you think.
>>>> Thanks,
>>>> Peter
>>>>
>>>> [1] -
>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0
>>>>
>>>>
>>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 8.,
>>>> Cs, 17:27):
>>>>
>>>>> Hi Iceberg community,
>>>>>
>>>>> I’d like to request feedback on a proposal
>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0>
>>>>> to introduce secondary indexes to Apache Iceberg with a narrow, 
>>>>> incremental
>>>>> scope.
>>>>>
>>>>> Phase 1 adds file-skipping indexes based on per-column Bloom filters,
>>>>> stored in Puffin and referenced from table metadata so query engines can
>>>>> use them during planning to prune data files. Indexes are advisory-only 
>>>>> and
>>>>> snapshot-scoped. The proposal is fully backward compatible: engines that
>>>>> don’t understand the new metadata fields ignore them.
>>>>>
>>>>> I’d appreciate any feedback, questions, or concerns on the overall
>>>>> direction and design.
>>>>>
>>>>> Best,
>>>>>
>>>>> Huaxin
>>>>>
>>>>

Re: [DISCUSS] Secondary Indexes (Phase 1): Bloom filter skipping index (Puffin, snapshot-scoped)

Reply via email to