Re: Dedicated sync for Iceberg Index Support

huaxin gao Tue, 03 Mar 2026 10:40:41 -0800

Thanks everyone for the great discussion on bloom filters during the
meeting! Here are the highlights:



   - Bloom filters are most useful for high-cardinality columns not in the
   table's sort/partition layout, where min/max stats are ineffective. This is
   a clear gap that existing metadata cannot address.
   - Bloom filters require careful tuning of sizing and false positive rate
   to be effective. A concrete design with FPR analysis would help demonstrate
   when and how they provide significant benefit.
   - To avoid bottlenecking the driver with bloom filter IO during
   planning, collocating per-file bloom filters into Puffin files aligned with
   manifest boundaries was proposed as a way to enable efficient distributed
   planning.
   - The group discussed how bloom filters should fit into the overall
   architecture, as a secondary index or as enhanced file-level metadata (like
   larger column stats). Storage options discussed include Puffin files
   referenced from manifests or a separate column file associated with
   manifests.

Thanks,

Huaxin

On Tue, Mar 3, 2026 at 9:06 AM Steven Wu <[email protected]> wrote:

> > if a column’s default value changes (a schema/metadata-only update), we
> may still need to refresh the index to ensure it returns correct results.
>
> initial-default value never changes after the column is added to the
> schema. The write-default can change but that only applies to new rows. I
> am not sure if we have a problem here
>
> On Tue, Mar 3, 2026 at 5:27 AM Péter Váry <[email protected]>
> wrote:
>
>> Thanks everyone who was participating on the community sync about the
>> indexes!
>>
>> Here is the recording:
>> https://www.youtube.com/watch?v=pZFJfAlMHsM&list=PLkifVhhWtccwbfBhHk_DGOogxXNtiKvbF
>> Here is the chat log:
>> https://drive.google.com/file/d/1_N1suxhhdHt4aQuoPuLX24KJz32w3qW0/view
>>
>> Added my highlights about the general index discussion to the doc:
>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.8041k7j2n7y3#heading=h.n0hz359alh52
>>
>> A few takeaway from general index the discussion:
>>
>>>
>>>    - We reviewed the options for synchronous and asynchronous index
>>>    updates. We agreed that asynchronous updates should be our primary focus,
>>>    while we expect that synchronous updates could still be valuable in 
>>> certain
>>>    scenarios. In those cases, we may be able to rely on the catalog REST API
>>>    to ensure that table updates and index updates occur atomically.
>>>
>>>
>>>    - We also touched on writer requirements. We would like to avoid
>>>    requiring extra work from writers, but in some cases this might be
>>>    necessary. Also, many tables typically have a single writer, table
>>>    maintenance operations still need to be taken into account. We may want 
>>> to
>>>    introduce a flag that blocks writes unless the writer is capable of
>>>    updating the index as well. Alternatively we could define a mechanism 
>>> that
>>>    ensures the table cannot be updated without updating the index.
>>>
>>>
>>>    - Prashant pointed out that we must also consider values stored
>>>    solely in table metadata when computing indexes. For example, if a 
>>> column’s
>>>    default value changes (a schema/metadata-only update), we may still need 
>>> to
>>>    refresh the index to ensure it returns correct results.
>>>
>>>
>> In the next sync, I would like to follow-up with the vector indexes and
>> if we have some time then the Index Maintenance.
>>
>> Thanks,
>> Peter
>>
>>
>> huaxin gao <[email protected]> ezt írta (időpont: 2026. márc. 2.,
>> H, 4:24):
>>
>>> Thanks Peter for the reminder and agenda!
>>>
>>> Here are some more details for the Bloom index status:
>>>
>>>
>>>    - When it helps: high-cardinality =/IN predicates where min/max
>>>    stats are not selective and many files remain after normal Iceberg 
>>> pruning
>>>    (“needle in a haystack”).
>>>    - Why it helps vs Parquet row-group Bloom: row-group Bloom still
>>>    requires opening each candidate data file (footer/Bloom pages). Puffin
>>>    Bloom is consulted during planning, so it can prune files before 
>>> scheduling
>>>    scan tasks and opening most files.
>>>    - Savings vs cost:
>>>       - Savings: plannedFiles → afterBloom (files avoided)
>>>       - Cost: planner reads statsFiles/statsBytes/bloomPayloadBytes
>>>       (Puffin footer + selective blob slices)
>>>       - Example (POC benchmark): plannedFiles=658, afterBloom=1
>>>       (needle), with index overhead statsFiles=1, statsBytes≈17MB,
>>>       bloomPayloadBytes≈16.8MB. The goal is to show “avoided per-file
>>>       opens/tasks” outweighs “index read”. This benchmark is intentionally 
>>> scoped
>>>       to the workload the feature targets; it’s not meant to claim Bloom 
>>> skipping
>>>       helps all queries, which is why the feature is opt-in. Users enable 
>>> this
>>>       when they see selective point lookups over many files and want to 
>>> reduce
>>>       file opens/task scheduling.
>>>    - Sizing: for fpp=0.01, Bloom needs 1.2 bytes per inserted value.
>>>    Example: ~10,000 values/file → ~12 KB Bloom payload per data file (plus
>>>    small Puffin overhead).
>>>    - Lifecycle/maintenance: incremental shards for new files;
>>>    missing/behind is safe (no pruning); shard compaction + snapshot
>>>    expiration/orphan cleanup to bound artifacts.
>>>    - Writer expectations: async maintenance is primary; inline is
>>>    optional (inline writers may not know the final number of inserted values
>>>    up front, so they can size at file close or use a scalable/growing Bloom
>>>    filter); any error/missing/stale index ⇒ fallback (correctness 
>>> unchanged).
>>>    Feature is opt-in for the targeted workload.
>>>
>>> Looking forward to the sync!
>>>
>>> Best,
>>>
>>> Huaxin
>>>
>>> On Sat, Feb 28, 2026 at 3:53 AM Péter Váry <[email protected]>
>>> wrote:
>>>
>>>> Please note that the next *Secondary Index Sync* will take place on *March
>>>> 2nd, 9:00-10:00 AM PT*.
>>>>
>>>> *Proposed agenda*:
>>>>
>>>>    - Discussion of potential use‑cases
>>>>       - Primary Key index for Flink equality‑delete resolution
>>>>       - Secondary data layout
>>>>          - Containing index
>>>>          - Alternative query plans
>>>>       - Vector index
>>>>    - Discussion of the two alternative approaches for metadata
>>>>    placement: keeping index metadata inside the table metadata vs. 
>>>> managing it
>>>>    externally through an Index Catalog
>>>>    - Bloom filter index status update
>>>>       - Performance justification: when this helps (high-cardinality =
>>>>       / IN, many data files, high object-store latency) and how it differs 
>>>> from
>>>>       Parquet row-group Bloom filters (which still require opening the 
>>>> data file).
>>>>       - Cost / scalability: rough sizing (Bloom blob size per file,
>>>>       Puffin file size), the planning cost trade-off (driver index reads vs
>>>>       executor file opens), and mitigations via caching.
>>>>       - Lifecycle / maintenance: incremental production as new data
>>>>       files arrive, behavior when the index is missing/behind, and
>>>>       sharding/compaction plus cleanup to avoid accumulating too many small
>>>>       Puffin files over time.
>>>>       - Writer expectations: inline (optional) vs asynchronous
>>>>       (primary) index creation.
>>>>
>>>> Looking forward to diving into this topic together.
>>>>
>>>> See you all there,
>>>> Peter
>>>>
>>>> Péter Váry <[email protected]> ezt írta (időpont: 2026.
>>>> febr. 25., Sze, 10:04):
>>>>
>>>>> Dan kindly set up a dedicated public Slack channel (*#indexes)* for
>>>>> the Secondary Index discussion.
>>>>> You can find it here:
>>>>> https://apache-iceberg.slack.com/archives/C0AFDSU3EUU
>>>>> Feel free to join if you’d like to participate in the discussion or
>>>>> simply follow along.
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>> Péter Váry <[email protected]> ezt írta (időpont: 2026.
>>>>> febr. 24., K, 12:52):
>>>>>
>>>>>> We had an extended discussion on Slack with Dan, Steven, and Yufei
>>>>>> about where index metadata should live. In particular, whether it should 
>>>>>> be
>>>>>> stored directly in the table metadata or maintained in a dedicated index
>>>>>> catalog. I tried to capture this discussion in the Layout
>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4oz3yd6ngr3>
>>>>>>  section
>>>>>> of the document.
>>>>>>
>>>>>> Once the decision is made, this section can be shortened, but for now
>>>>>> it is intentionally more detailed so that everyone can see the arguments
>>>>>> that were discussed and so that those who could not participate
>>>>>> synchronously can still follow and provide feedback offline.
>>>>>>
>>>>>> In short, we are currently *leaning toward storing index metadata in
>>>>>> its own catalog*, while allowing REST catalogs to expose a composite
>>>>>> endpoint that returns both table and index metadata in a single round 
>>>>>> trip.
>>>>>> This is similar in spirit to the universal load endpoint discussed in the
>>>>>> context of materialized view loading.
>>>>>>
>>>>>> Thanks,
>>>>>> Peter
>>>>>>
>>>>>> Péter Váry <[email protected]> ezt írta (időpont: 2026.
>>>>>> febr. 19., Cs, 14:06):
>>>>>>
>>>>>>> Thanks Huaxin for posting the recording and the meeting notes.
>>>>>>>
>>>>>>> I used this time to also address the questions collected during the
>>>>>>> sync:
>>>>>>>
>>>>>>>    - Collected some representative use cases. See the example
>>>>>>>    use-cases
>>>>>>>    
>>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.i4gt8za99j9d>
>>>>>>>  paragraph.
>>>>>>>    Anyone should feel free to suggest their own.
>>>>>>>    - Collected my thoughts about the writer requirements. See the writer
>>>>>>>    requirements
>>>>>>>    
>>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4b1p8r8nmfg1>
>>>>>>>    paragraph.
>>>>>>>    - Centralized the index maintenance related parts. See the index
>>>>>>>    maintenance
>>>>>>>    
>>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.hw2nt44i0k8q>
>>>>>>>    paragraph.
>>>>>>>
>>>>>>> Might be a bit premature but created a PR
>>>>>>> <https://github.com/apache/iceberg/pull/15101> with the
>>>>>>> proposed index catalog related changes, so the ones who are more code
>>>>>>> oriented could take a look at it too.
>>>>>>>
>>>>>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. febr.
>>>>>>> 19., Cs, 5:34):
>>>>>>>
>>>>>>>> Hi Everyone,
>>>>>>>>
>>>>>>>> Here are the recording and notes from the Iceberg Index Support
>>>>>>>> Sync on 2/11.
>>>>>>>>
>>>>>>>> Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk
>>>>>>>>
>>>>>>>> Notes:
>>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3
>>>>>>>>
>>>>>>>> The meeting will move to biweekly, Mondays 9–10am PST, starting
>>>>>>>> March 2.
>>>>>>>>
>>>>>>>> Since the sync, I updated the Bloom skipping index proposal
>>>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu>
>>>>>>>> to address the discussion questions, specifically:
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Performance justification: when this helps (high-cardinality
>>>>>>>>    = / IN, many data files, high object-store latency) and how it 
>>>>>>>> differs from
>>>>>>>>    Parquet row-group Bloom filters (which still require opening the 
>>>>>>>> data file).
>>>>>>>>    - Cost / scalability: rough sizing (Bloom blob size per file,
>>>>>>>>    Puffin file size), the planning cost trade-off (driver index reads 
>>>>>>>> vs
>>>>>>>>    executor file opens), and mitigations via caching.
>>>>>>>>    - Lifecycle / maintenance: incremental production as new data
>>>>>>>>    files arrive, behavior when the index is missing/behind, and
>>>>>>>>    sharding/compaction plus cleanup to avoid accumulating too many 
>>>>>>>> small
>>>>>>>>    Puffin files over time.
>>>>>>>>    - Writer expectations: inline (optional) vs asynchronous
>>>>>>>>    (primary) index creation.
>>>>>>>>
>>>>>>>> I also implemented a Spark 4.1 POC
>>>>>>>> <https://github.com/apache/iceberg/pull/15311> and a local
>>>>>>>> benchmark to quantify both the pruning impact (plannedFiles → 
>>>>>>>> afterBloom)
>>>>>>>> and the index read overhead (statsFiles, statsBytes, 
>>>>>>>> bloomPayloadBytes) for
>>>>>>>> point predicates on high-cardinality columns. Please take a look and 
>>>>>>>> let me
>>>>>>>> know if you have any questions or feedback.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Huaxin
>>>>>>>>
>>>>>>>> On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Reminder for tomorrow's sync on Iceberg Index Support.
>>>>>>>>>
>>>>>>>>> Wednesday: Feb. 11 9:00 – 10:00am
>>>>>>>>> Time zone: America/Los_Angeles
>>>>>>>>> Google Meet joining info
>>>>>>>>> Video call link: meet.google.com/nsp-ctyr-khk
>>>>>>>>> Design doc:
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Huaxin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 3, 2026 at 10:52 PM Péter Váry <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Huaxin and Steven for organizing this. Looking forward to
>>>>>>>>>> meet you all next week!
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> We set up the dev calendar event with a new google meet link.
>>>>>>>>>>> Please ignore the link from Huaxin's original email.
>>>>>>>>>>>
>>>>>>>>>>> The dev calendar has the correct info (including the new meeting
>>>>>>>>>>> link)
>>>>>>>>>>>
>>>>>>>>>>> Iceberg Index Support Sync
>>>>>>>>>>> Wednesday, February 11 · 9:00 – 10:00am
>>>>>>>>>>> Time zone: America/Los_Angeles
>>>>>>>>>>> Google Meet joining info
>>>>>>>>>>> Video call link: https://meet.google.com/nsp-ctyr-khk
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry, I meant PST (not EST) :)
>>>>>>>>>>>> Looking forward to the discussion!
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Huaxin,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for starting the sync!
>>>>>>>>>>>>>
>>>>>>>>>>>>> The meeting seems to be 9-10AM PST on the dev events calendar
>>>>>>>>>>>>> <https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t>,
>>>>>>>>>>>>> not EST. Maybe it's a typo?
>>>>>>>>>>>>> Otherwise, looking forward to the discussion!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> I'd like to start a dedicated sync to discuss Iceberg Index
>>>>>>>>>>>>>> support. Here is the existing discussion thread:
>>>>>>>>>>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To ground the discussion, here are the two proposals:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Peter's proposal
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2>
>>>>>>>>>>>>>>  (overall
>>>>>>>>>>>>>>    index support)
>>>>>>>>>>>>>>    - My proposal
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7>
>>>>>>>>>>>>>>    (bloom filter skipping index)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST,
>>>>>>>>>>>>>> starting next Wednesday (2/11). After FileFormat sync finishes, 
>>>>>>>>>>>>>> we plan to
>>>>>>>>>>>>>> use that slot and switch to every other Monday, 9 AM to 10 AM 
>>>>>>>>>>>>>> EST.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Meet link: https://meet.google.com/fjn-tyze-mko
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Huaxin
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: Dedicated sync for Iceberg Index Support

Reply via email to