Thanks everyone for the great discussion on bloom filters during the meeting! Here are the highlights:
- Bloom filters are most useful for high-cardinality columns not in the table's sort/partition layout, where min/max stats are ineffective. This is a clear gap that existing metadata cannot address. - Bloom filters require careful tuning of sizing and false positive rate to be effective. A concrete design with FPR analysis would help demonstrate when and how they provide significant benefit. - To avoid bottlenecking the driver with bloom filter IO during planning, collocating per-file bloom filters into Puffin files aligned with manifest boundaries was proposed as a way to enable efficient distributed planning. - The group discussed how bloom filters should fit into the overall architecture, as a secondary index or as enhanced file-level metadata (like larger column stats). Storage options discussed include Puffin files referenced from manifests or a separate column file associated with manifests. Thanks, Huaxin On Tue, Mar 3, 2026 at 9:06 AM Steven Wu <[email protected]> wrote: > > if a column’s default value changes (a schema/metadata-only update), we > may still need to refresh the index to ensure it returns correct results. > > initial-default value never changes after the column is added to the > schema. The write-default can change but that only applies to new rows. I > am not sure if we have a problem here > > On Tue, Mar 3, 2026 at 5:27 AM Péter Váry <[email protected]> > wrote: > >> Thanks everyone who was participating on the community sync about the >> indexes! >> >> Here is the recording: >> https://www.youtube.com/watch?v=pZFJfAlMHsM&list=PLkifVhhWtccwbfBhHk_DGOogxXNtiKvbF >> Here is the chat log: >> https://drive.google.com/file/d/1_N1suxhhdHt4aQuoPuLX24KJz32w3qW0/view >> >> Added my highlights about the general index discussion to the doc: >> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.8041k7j2n7y3#heading=h.n0hz359alh52 >> >> A few takeaway from general index the discussion: >> >>> >>> - We reviewed the options for synchronous and asynchronous index >>> updates. We agreed that asynchronous updates should be our primary focus, >>> while we expect that synchronous updates could still be valuable in >>> certain >>> scenarios. In those cases, we may be able to rely on the catalog REST API >>> to ensure that table updates and index updates occur atomically. >>> >>> >>> - We also touched on writer requirements. We would like to avoid >>> requiring extra work from writers, but in some cases this might be >>> necessary. Also, many tables typically have a single writer, table >>> maintenance operations still need to be taken into account. We may want >>> to >>> introduce a flag that blocks writes unless the writer is capable of >>> updating the index as well. Alternatively we could define a mechanism >>> that >>> ensures the table cannot be updated without updating the index. >>> >>> >>> - Prashant pointed out that we must also consider values stored >>> solely in table metadata when computing indexes. For example, if a >>> column’s >>> default value changes (a schema/metadata-only update), we may still need >>> to >>> refresh the index to ensure it returns correct results. >>> >>> >> In the next sync, I would like to follow-up with the vector indexes and >> if we have some time then the Index Maintenance. >> >> Thanks, >> Peter >> >> >> huaxin gao <[email protected]> ezt írta (időpont: 2026. márc. 2., >> H, 4:24): >> >>> Thanks Peter for the reminder and agenda! >>> >>> Here are some more details for the Bloom index status: >>> >>> >>> - When it helps: high-cardinality =/IN predicates where min/max >>> stats are not selective and many files remain after normal Iceberg >>> pruning >>> (“needle in a haystack”). >>> - Why it helps vs Parquet row-group Bloom: row-group Bloom still >>> requires opening each candidate data file (footer/Bloom pages). Puffin >>> Bloom is consulted during planning, so it can prune files before >>> scheduling >>> scan tasks and opening most files. >>> - Savings vs cost: >>> - Savings: plannedFiles → afterBloom (files avoided) >>> - Cost: planner reads statsFiles/statsBytes/bloomPayloadBytes >>> (Puffin footer + selective blob slices) >>> - Example (POC benchmark): plannedFiles=658, afterBloom=1 >>> (needle), with index overhead statsFiles=1, statsBytes≈17MB, >>> bloomPayloadBytes≈16.8MB. The goal is to show “avoided per-file >>> opens/tasks” outweighs “index read”. This benchmark is intentionally >>> scoped >>> to the workload the feature targets; it’s not meant to claim Bloom >>> skipping >>> helps all queries, which is why the feature is opt-in. Users enable >>> this >>> when they see selective point lookups over many files and want to >>> reduce >>> file opens/task scheduling. >>> - Sizing: for fpp=0.01, Bloom needs 1.2 bytes per inserted value. >>> Example: ~10,000 values/file → ~12 KB Bloom payload per data file (plus >>> small Puffin overhead). >>> - Lifecycle/maintenance: incremental shards for new files; >>> missing/behind is safe (no pruning); shard compaction + snapshot >>> expiration/orphan cleanup to bound artifacts. >>> - Writer expectations: async maintenance is primary; inline is >>> optional (inline writers may not know the final number of inserted values >>> up front, so they can size at file close or use a scalable/growing Bloom >>> filter); any error/missing/stale index ⇒ fallback (correctness >>> unchanged). >>> Feature is opt-in for the targeted workload. >>> >>> Looking forward to the sync! >>> >>> Best, >>> >>> Huaxin >>> >>> On Sat, Feb 28, 2026 at 3:53 AM Péter Váry <[email protected]> >>> wrote: >>> >>>> Please note that the next *Secondary Index Sync* will take place on *March >>>> 2nd, 9:00-10:00 AM PT*. >>>> >>>> *Proposed agenda*: >>>> >>>> - Discussion of potential use‑cases >>>> - Primary Key index for Flink equality‑delete resolution >>>> - Secondary data layout >>>> - Containing index >>>> - Alternative query plans >>>> - Vector index >>>> - Discussion of the two alternative approaches for metadata >>>> placement: keeping index metadata inside the table metadata vs. >>>> managing it >>>> externally through an Index Catalog >>>> - Bloom filter index status update >>>> - Performance justification: when this helps (high-cardinality = >>>> / IN, many data files, high object-store latency) and how it differs >>>> from >>>> Parquet row-group Bloom filters (which still require opening the >>>> data file). >>>> - Cost / scalability: rough sizing (Bloom blob size per file, >>>> Puffin file size), the planning cost trade-off (driver index reads vs >>>> executor file opens), and mitigations via caching. >>>> - Lifecycle / maintenance: incremental production as new data >>>> files arrive, behavior when the index is missing/behind, and >>>> sharding/compaction plus cleanup to avoid accumulating too many small >>>> Puffin files over time. >>>> - Writer expectations: inline (optional) vs asynchronous >>>> (primary) index creation. >>>> >>>> Looking forward to diving into this topic together. >>>> >>>> See you all there, >>>> Peter >>>> >>>> Péter Váry <[email protected]> ezt írta (időpont: 2026. >>>> febr. 25., Sze, 10:04): >>>> >>>>> Dan kindly set up a dedicated public Slack channel (*#indexes)* for >>>>> the Secondary Index discussion. >>>>> You can find it here: >>>>> https://apache-iceberg.slack.com/archives/C0AFDSU3EUU >>>>> Feel free to join if you’d like to participate in the discussion or >>>>> simply follow along. >>>>> >>>>> Thanks, >>>>> Peter >>>>> >>>>> Péter Váry <[email protected]> ezt írta (időpont: 2026. >>>>> febr. 24., K, 12:52): >>>>> >>>>>> We had an extended discussion on Slack with Dan, Steven, and Yufei >>>>>> about where index metadata should live. In particular, whether it should >>>>>> be >>>>>> stored directly in the table metadata or maintained in a dedicated index >>>>>> catalog. I tried to capture this discussion in the Layout >>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4oz3yd6ngr3> >>>>>> section >>>>>> of the document. >>>>>> >>>>>> Once the decision is made, this section can be shortened, but for now >>>>>> it is intentionally more detailed so that everyone can see the arguments >>>>>> that were discussed and so that those who could not participate >>>>>> synchronously can still follow and provide feedback offline. >>>>>> >>>>>> In short, we are currently *leaning toward storing index metadata in >>>>>> its own catalog*, while allowing REST catalogs to expose a composite >>>>>> endpoint that returns both table and index metadata in a single round >>>>>> trip. >>>>>> This is similar in spirit to the universal load endpoint discussed in the >>>>>> context of materialized view loading. >>>>>> >>>>>> Thanks, >>>>>> Peter >>>>>> >>>>>> Péter Váry <[email protected]> ezt írta (időpont: 2026. >>>>>> febr. 19., Cs, 14:06): >>>>>> >>>>>>> Thanks Huaxin for posting the recording and the meeting notes. >>>>>>> >>>>>>> I used this time to also address the questions collected during the >>>>>>> sync: >>>>>>> >>>>>>> - Collected some representative use cases. See the example >>>>>>> use-cases >>>>>>> >>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.i4gt8za99j9d> >>>>>>> paragraph. >>>>>>> Anyone should feel free to suggest their own. >>>>>>> - Collected my thoughts about the writer requirements. See the writer >>>>>>> requirements >>>>>>> >>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4b1p8r8nmfg1> >>>>>>> paragraph. >>>>>>> - Centralized the index maintenance related parts. See the index >>>>>>> maintenance >>>>>>> >>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.hw2nt44i0k8q> >>>>>>> paragraph. >>>>>>> >>>>>>> Might be a bit premature but created a PR >>>>>>> <https://github.com/apache/iceberg/pull/15101> with the >>>>>>> proposed index catalog related changes, so the ones who are more code >>>>>>> oriented could take a look at it too. >>>>>>> >>>>>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. febr. >>>>>>> 19., Cs, 5:34): >>>>>>> >>>>>>>> Hi Everyone, >>>>>>>> >>>>>>>> Here are the recording and notes from the Iceberg Index Support >>>>>>>> Sync on 2/11. >>>>>>>> >>>>>>>> Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk >>>>>>>> >>>>>>>> Notes: >>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3 >>>>>>>> >>>>>>>> The meeting will move to biweekly, Mondays 9–10am PST, starting >>>>>>>> March 2. >>>>>>>> >>>>>>>> Since the sync, I updated the Bloom skipping index proposal >>>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu> >>>>>>>> to address the discussion questions, specifically: >>>>>>>> >>>>>>>> >>>>>>>> - Performance justification: when this helps (high-cardinality >>>>>>>> = / IN, many data files, high object-store latency) and how it >>>>>>>> differs from >>>>>>>> Parquet row-group Bloom filters (which still require opening the >>>>>>>> data file). >>>>>>>> - Cost / scalability: rough sizing (Bloom blob size per file, >>>>>>>> Puffin file size), the planning cost trade-off (driver index reads >>>>>>>> vs >>>>>>>> executor file opens), and mitigations via caching. >>>>>>>> - Lifecycle / maintenance: incremental production as new data >>>>>>>> files arrive, behavior when the index is missing/behind, and >>>>>>>> sharding/compaction plus cleanup to avoid accumulating too many >>>>>>>> small >>>>>>>> Puffin files over time. >>>>>>>> - Writer expectations: inline (optional) vs asynchronous >>>>>>>> (primary) index creation. >>>>>>>> >>>>>>>> I also implemented a Spark 4.1 POC >>>>>>>> <https://github.com/apache/iceberg/pull/15311> and a local >>>>>>>> benchmark to quantify both the pruning impact (plannedFiles → >>>>>>>> afterBloom) >>>>>>>> and the index read overhead (statsFiles, statsBytes, >>>>>>>> bloomPayloadBytes) for >>>>>>>> point predicates on high-cardinality columns. Please take a look and >>>>>>>> let me >>>>>>>> know if you have any questions or feedback. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Huaxin >>>>>>>> >>>>>>>> On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Reminder for tomorrow's sync on Iceberg Index Support. >>>>>>>>> >>>>>>>>> Wednesday: Feb. 11 9:00 – 10:00am >>>>>>>>> Time zone: America/Los_Angeles >>>>>>>>> Google Meet joining info >>>>>>>>> Video call link: meet.google.com/nsp-ctyr-khk >>>>>>>>> Design doc: >>>>>>>>> >>>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2 >>>>>>>>> >>>>>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7 >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Huaxin >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Feb 3, 2026 at 10:52 PM Péter Váry < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Thanks Huaxin and Steven for organizing this. Looking forward to >>>>>>>>>> meet you all next week! >>>>>>>>>> >>>>>>>>>> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> We set up the dev calendar event with a new google meet link. >>>>>>>>>>> Please ignore the link from Huaxin's original email. >>>>>>>>>>> >>>>>>>>>>> The dev calendar has the correct info (including the new meeting >>>>>>>>>>> link) >>>>>>>>>>> >>>>>>>>>>> Iceberg Index Support Sync >>>>>>>>>>> Wednesday, February 11 · 9:00 – 10:00am >>>>>>>>>>> Time zone: America/Los_Angeles >>>>>>>>>>> Google Meet joining info >>>>>>>>>>> Video call link: https://meet.google.com/nsp-ctyr-khk >>>>>>>>>>> >>>>>>>>>>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Sorry, I meant PST (not EST) :) >>>>>>>>>>>> Looking forward to the discussion! >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Huaxin, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for starting the sync! >>>>>>>>>>>>> >>>>>>>>>>>>> The meeting seems to be 9-10AM PST on the dev events calendar >>>>>>>>>>>>> <https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t>, >>>>>>>>>>>>> not EST. Maybe it's a typo? >>>>>>>>>>>>> Otherwise, looking forward to the discussion! >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Shawn >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> I'd like to start a dedicated sync to discuss Iceberg Index >>>>>>>>>>>>>> support. Here is the existing discussion thread: >>>>>>>>>>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty >>>>>>>>>>>>>> . >>>>>>>>>>>>>> >>>>>>>>>>>>>> To ground the discussion, here are the two proposals: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Peter's proposal >>>>>>>>>>>>>> >>>>>>>>>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2> >>>>>>>>>>>>>> (overall >>>>>>>>>>>>>> index support) >>>>>>>>>>>>>> - My proposal >>>>>>>>>>>>>> >>>>>>>>>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7> >>>>>>>>>>>>>> (bloom filter skipping index) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST, >>>>>>>>>>>>>> starting next Wednesday (2/11). After FileFormat sync finishes, >>>>>>>>>>>>>> we plan to >>>>>>>>>>>>>> use that slot and switch to every other Monday, 9 AM to 10 AM >>>>>>>>>>>>>> EST. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Meet link: https://meet.google.com/fjn-tyze-mko >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Huaxin >>>>>>>>>>>>>> >>>>>>>>>>>>>
