RE: Re: [DISCUSS] Secondary Indexes (Phase 1): Bloom filter skipping index (Puffin, snapshot-scoped)

Guy Khazma Mon, 12 Jan 2026 12:10:05 -0800

Hi Huaxin, Peter,

Happy to collaborate on this as well.
I added some comments to the document.
A while ago we had a proposal to have a pluggable interface for file
filtering which may be relevant for this discussion:
https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit?tab=t.0#heading=h.uqr5wcfm85p7



Thanks,
Guy

On 2026/01/12 10:53:24 Péter Váry wrote:
> Cool!
> Happy to collaborate on this!
>
> > keep only minimal snapshot references in table metadata and move the
> richer index definition and lifecycle into catalog‑managed index metadata
> exposed via the REST APIs.
>
> In my second iteration, I moved the snapshot references into the index
> metadata [1]. This allows the query engine to fetch indexes in parallel
> with the table metadata using *catalog.listIndexes*, where each returned
> *BaseIndex* already includes the available table snapshots.
> With that information, the engine can immediately determine whether a
given
> index is applicable for the query by checking the index type, index
> columns, and the associated table snapshots.
> If the engine decides to use a particular index, it can then retrieve the
> corresponding DetailedIndex, which contains all additional details
required
> by the engine.
> For Bloom filter indexes specifically, the *IndexSnapshots* could store
the
> correct Puffin file path for each table snapshot in their snapshot
> properties.
>
> [1] - Iceberg indexes / Index Metadata / Snapshot -
>
https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.r3lv3a6k06hy
>
> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 12., H,
> 2:27):
>
> > Hi Peter,
> >
> >
> > Thanks a lot for sharing the proposal in [1] and for the detailed
design.
> > The catalog‑managed index framework there looks like a better long‑term
> > direction than keeping full index definitions in table metadata.
> >
> >
> > The current Bloom‑filter draft describes indexes in table metadata so
> > planners can discover them during planning and map table snapshots to
> > Puffin files with Bloom filters, but that wiring can be changed easily
to
> > the catalog‑based model in [1]: keep only minimal snapshot references in
> > table metadata and move the richer index definition and lifecycle into
> > catalog‑managed index metadata exposed via the REST APIs. In that model,
> > the Bloom‑filter file‑skipping index would be one concrete `IndexType`
> > whose data lives in Puffin files, with engines discovering and loading
it
> > through the catalog (`listIndexes`, `loadIndex`, etc.).
> >
> >
> > Agree that the Bloom‑filter index would be an excellent candidate and a
> > very good fit as the first index type to implement in this framework,
and
> > the proposal will be updated to follow the catalog‑based approach.
> >
> >
> > Best,
> >
> > Huaxin
> >
> >
> >
> >
> >
> > On Fri, Jan 9, 2026 at 11:59 AM Péter Váry <[email protected]>
> > wrote:
> >
> >> Hi Huaxin,
> >>
> >> This is a very interesting topic. We’re also working on an index
proposal
> >> [1] that aligns closely with yours in many areas. In an earlier
iteration,
> >> I considered adding index metadata directly to the table metadata as
well.
> >> After some back-and-forth, we ultimately moved to a different approach,
> >> where the catalog exposes an API to fetch the indexes for a given
table.
> >>
> >> This has several advantages—for example, it avoids increasing the size
of
> >> the table metadata and is more consistent with existing practices where
> >> UDFs, views, and materialized views each have their own specifications
and
> >> metadata.
> >>
> >> After reading your proposal, I think the bloom filter index would be an
> >> excellent candidate and a very good fit as a first index type to
implement,
> >> helping us evaluate the viability of the metadata approach.
> >>
> >> Please take a look and let me know what you think.
> >> Thanks,
> >> Peter
> >>
> >> [1] -
> >>
https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0
> >>
> >>
> >> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 8.,
> >> Cs, 17:27):
> >>
> >>> Hi Iceberg community,
> >>>
> >>> I’d like to request feedback on a proposal
> >>> <
https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0>
> >>> to introduce secondary indexes to Apache Iceberg with a narrow,
incremental
> >>> scope.
> >>>
> >>> Phase 1 adds file-skipping indexes based on per-column Bloom filters,
> >>> stored in Puffin and referenced from table metadata so query engines
can
> >>> use them during planning to prune data files. Indexes are
advisory-only and
> >>> snapshot-scoped. The proposal is fully backward compatible: engines
that
> >>> don’t understand the new metadata fields ignore them.
> >>>
> >>> I’d appreciate any feedback, questions, or concerns on the overall
> >>> direction and design.
> >>>
> >>> Best,
> >>>
> >>> Huaxin
> >>>
> >>
>

RE: Re: [DISCUSS] Secondary Indexes (Phase 1): Bloom filter skipping index (Puffin, snapshot-scoped)

Reply via email to