> if a column’s default value changes (a schema/metadata-only update), we may still need to refresh the index to ensure it returns correct results.
initial-default value never changes after the column is added to the schema. The write-default can change but that only applies to new rows. I am not sure if we have a problem here On Tue, Mar 3, 2026 at 5:27 AM Péter Váry <[email protected]> wrote: > Thanks everyone who was participating on the community sync about the > indexes! > > Here is the recording: > https://www.youtube.com/watch?v=pZFJfAlMHsM&list=PLkifVhhWtccwbfBhHk_DGOogxXNtiKvbF > Here is the chat log: > https://drive.google.com/file/d/1_N1suxhhdHt4aQuoPuLX24KJz32w3qW0/view > > Added my highlights about the general index discussion to the doc: > https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.8041k7j2n7y3#heading=h.n0hz359alh52 > > A few takeaway from general index the discussion: > >> >> - We reviewed the options for synchronous and asynchronous index >> updates. We agreed that asynchronous updates should be our primary focus, >> while we expect that synchronous updates could still be valuable in >> certain >> scenarios. In those cases, we may be able to rely on the catalog REST API >> to ensure that table updates and index updates occur atomically. >> >> >> - We also touched on writer requirements. We would like to avoid >> requiring extra work from writers, but in some cases this might be >> necessary. Also, many tables typically have a single writer, table >> maintenance operations still need to be taken into account. We may want to >> introduce a flag that blocks writes unless the writer is capable of >> updating the index as well. Alternatively we could define a mechanism that >> ensures the table cannot be updated without updating the index. >> >> >> - Prashant pointed out that we must also consider values stored >> solely in table metadata when computing indexes. For example, if a >> column’s >> default value changes (a schema/metadata-only update), we may still need >> to >> refresh the index to ensure it returns correct results. >> >> > In the next sync, I would like to follow-up with the vector indexes and if > we have some time then the Index Maintenance. > > Thanks, > Peter > > > huaxin gao <[email protected]> ezt írta (időpont: 2026. márc. 2., H, > 4:24): > >> Thanks Peter for the reminder and agenda! >> >> Here are some more details for the Bloom index status: >> >> >> - When it helps: high-cardinality =/IN predicates where min/max stats >> are not selective and many files remain after normal Iceberg pruning >> (“needle in a haystack”). >> - Why it helps vs Parquet row-group Bloom: row-group Bloom still >> requires opening each candidate data file (footer/Bloom pages). Puffin >> Bloom is consulted during planning, so it can prune files before >> scheduling >> scan tasks and opening most files. >> - Savings vs cost: >> - Savings: plannedFiles → afterBloom (files avoided) >> - Cost: planner reads statsFiles/statsBytes/bloomPayloadBytes >> (Puffin footer + selective blob slices) >> - Example (POC benchmark): plannedFiles=658, afterBloom=1 >> (needle), with index overhead statsFiles=1, statsBytes≈17MB, >> bloomPayloadBytes≈16.8MB. The goal is to show “avoided per-file >> opens/tasks” outweighs “index read”. This benchmark is intentionally >> scoped >> to the workload the feature targets; it’s not meant to claim Bloom >> skipping >> helps all queries, which is why the feature is opt-in. Users enable >> this >> when they see selective point lookups over many files and want to >> reduce >> file opens/task scheduling. >> - Sizing: for fpp=0.01, Bloom needs 1.2 bytes per inserted value. >> Example: ~10,000 values/file → ~12 KB Bloom payload per data file (plus >> small Puffin overhead). >> - Lifecycle/maintenance: incremental shards for new files; >> missing/behind is safe (no pruning); shard compaction + snapshot >> expiration/orphan cleanup to bound artifacts. >> - Writer expectations: async maintenance is primary; inline is >> optional (inline writers may not know the final number of inserted values >> up front, so they can size at file close or use a scalable/growing Bloom >> filter); any error/missing/stale index ⇒ fallback (correctness unchanged). >> Feature is opt-in for the targeted workload. >> >> Looking forward to the sync! >> >> Best, >> >> Huaxin >> >> On Sat, Feb 28, 2026 at 3:53 AM Péter Váry <[email protected]> >> wrote: >> >>> Please note that the next *Secondary Index Sync* will take place on *March >>> 2nd, 9:00-10:00 AM PT*. >>> >>> *Proposed agenda*: >>> >>> - Discussion of potential use‑cases >>> - Primary Key index for Flink equality‑delete resolution >>> - Secondary data layout >>> - Containing index >>> - Alternative query plans >>> - Vector index >>> - Discussion of the two alternative approaches for metadata >>> placement: keeping index metadata inside the table metadata vs. managing >>> it >>> externally through an Index Catalog >>> - Bloom filter index status update >>> - Performance justification: when this helps (high-cardinality = >>> / IN, many data files, high object-store latency) and how it differs >>> from >>> Parquet row-group Bloom filters (which still require opening the data >>> file). >>> - Cost / scalability: rough sizing (Bloom blob size per file, >>> Puffin file size), the planning cost trade-off (driver index reads vs >>> executor file opens), and mitigations via caching. >>> - Lifecycle / maintenance: incremental production as new data >>> files arrive, behavior when the index is missing/behind, and >>> sharding/compaction plus cleanup to avoid accumulating too many small >>> Puffin files over time. >>> - Writer expectations: inline (optional) vs asynchronous >>> (primary) index creation. >>> >>> Looking forward to diving into this topic together. >>> >>> See you all there, >>> Peter >>> >>> Péter Váry <[email protected]> ezt írta (időpont: 2026. febr. >>> 25., Sze, 10:04): >>> >>>> Dan kindly set up a dedicated public Slack channel (*#indexes)* for >>>> the Secondary Index discussion. >>>> You can find it here: >>>> https://apache-iceberg.slack.com/archives/C0AFDSU3EUU >>>> Feel free to join if you’d like to participate in the discussion or >>>> simply follow along. >>>> >>>> Thanks, >>>> Peter >>>> >>>> Péter Váry <[email protected]> ezt írta (időpont: 2026. >>>> febr. 24., K, 12:52): >>>> >>>>> We had an extended discussion on Slack with Dan, Steven, and Yufei >>>>> about where index metadata should live. In particular, whether it should >>>>> be >>>>> stored directly in the table metadata or maintained in a dedicated index >>>>> catalog. I tried to capture this discussion in the Layout >>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4oz3yd6ngr3> >>>>> section >>>>> of the document. >>>>> >>>>> Once the decision is made, this section can be shortened, but for now >>>>> it is intentionally more detailed so that everyone can see the arguments >>>>> that were discussed and so that those who could not participate >>>>> synchronously can still follow and provide feedback offline. >>>>> >>>>> In short, we are currently *leaning toward storing index metadata in >>>>> its own catalog*, while allowing REST catalogs to expose a composite >>>>> endpoint that returns both table and index metadata in a single round >>>>> trip. >>>>> This is similar in spirit to the universal load endpoint discussed in the >>>>> context of materialized view loading. >>>>> >>>>> Thanks, >>>>> Peter >>>>> >>>>> Péter Váry <[email protected]> ezt írta (időpont: 2026. >>>>> febr. 19., Cs, 14:06): >>>>> >>>>>> Thanks Huaxin for posting the recording and the meeting notes. >>>>>> >>>>>> I used this time to also address the questions collected during the >>>>>> sync: >>>>>> >>>>>> - Collected some representative use cases. See the example >>>>>> use-cases >>>>>> >>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.i4gt8za99j9d> >>>>>> paragraph. >>>>>> Anyone should feel free to suggest their own. >>>>>> - Collected my thoughts about the writer requirements. See the writer >>>>>> requirements >>>>>> >>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4b1p8r8nmfg1> >>>>>> paragraph. >>>>>> - Centralized the index maintenance related parts. See the index >>>>>> maintenance >>>>>> >>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.hw2nt44i0k8q> >>>>>> paragraph. >>>>>> >>>>>> Might be a bit premature but created a PR >>>>>> <https://github.com/apache/iceberg/pull/15101> with the >>>>>> proposed index catalog related changes, so the ones who are more code >>>>>> oriented could take a look at it too. >>>>>> >>>>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. febr. >>>>>> 19., Cs, 5:34): >>>>>> >>>>>>> Hi Everyone, >>>>>>> >>>>>>> Here are the recording and notes from the Iceberg Index Support Sync >>>>>>> on 2/11. >>>>>>> >>>>>>> Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk >>>>>>> >>>>>>> Notes: >>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3 >>>>>>> >>>>>>> The meeting will move to biweekly, Mondays 9–10am PST, starting >>>>>>> March 2. >>>>>>> >>>>>>> Since the sync, I updated the Bloom skipping index proposal >>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu> >>>>>>> to address the discussion questions, specifically: >>>>>>> >>>>>>> >>>>>>> - Performance justification: when this helps (high-cardinality = >>>>>>> / IN, many data files, high object-store latency) and how it differs >>>>>>> from >>>>>>> Parquet row-group Bloom filters (which still require opening the >>>>>>> data file). >>>>>>> - Cost / scalability: rough sizing (Bloom blob size per file, >>>>>>> Puffin file size), the planning cost trade-off (driver index reads vs >>>>>>> executor file opens), and mitigations via caching. >>>>>>> - Lifecycle / maintenance: incremental production as new data >>>>>>> files arrive, behavior when the index is missing/behind, and >>>>>>> sharding/compaction plus cleanup to avoid accumulating too many small >>>>>>> Puffin files over time. >>>>>>> - Writer expectations: inline (optional) vs asynchronous >>>>>>> (primary) index creation. >>>>>>> >>>>>>> I also implemented a Spark 4.1 POC >>>>>>> <https://github.com/apache/iceberg/pull/15311> and a local >>>>>>> benchmark to quantify both the pruning impact (plannedFiles → >>>>>>> afterBloom) >>>>>>> and the index read overhead (statsFiles, statsBytes, bloomPayloadBytes) >>>>>>> for >>>>>>> point predicates on high-cardinality columns. Please take a look and >>>>>>> let me >>>>>>> know if you have any questions or feedback. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Huaxin >>>>>>> >>>>>>> On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Reminder for tomorrow's sync on Iceberg Index Support. >>>>>>>> >>>>>>>> Wednesday: Feb. 11 9:00 – 10:00am >>>>>>>> Time zone: America/Los_Angeles >>>>>>>> Google Meet joining info >>>>>>>> Video call link: meet.google.com/nsp-ctyr-khk >>>>>>>> Design doc: >>>>>>>> >>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2 >>>>>>>> >>>>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7 >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Huaxin >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Feb 3, 2026 at 10:52 PM Péter Váry < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Thanks Huaxin and Steven for organizing this. Looking forward to >>>>>>>>> meet you all next week! >>>>>>>>> >>>>>>>>> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]> wrote: >>>>>>>>> >>>>>>>>>> We set up the dev calendar event with a new google meet link. >>>>>>>>>> Please ignore the link from Huaxin's original email. >>>>>>>>>> >>>>>>>>>> The dev calendar has the correct info (including the new meeting >>>>>>>>>> link) >>>>>>>>>> >>>>>>>>>> Iceberg Index Support Sync >>>>>>>>>> Wednesday, February 11 · 9:00 – 10:00am >>>>>>>>>> Time zone: America/Los_Angeles >>>>>>>>>> Google Meet joining info >>>>>>>>>> Video call link: https://meet.google.com/nsp-ctyr-khk >>>>>>>>>> >>>>>>>>>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Sorry, I meant PST (not EST) :) >>>>>>>>>>> Looking forward to the discussion! >>>>>>>>>>> >>>>>>>>>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Huaxin, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for starting the sync! >>>>>>>>>>>> >>>>>>>>>>>> The meeting seems to be 9-10AM PST on the dev events calendar >>>>>>>>>>>> <https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t>, >>>>>>>>>>>> not EST. Maybe it's a typo? >>>>>>>>>>>> Otherwise, looking forward to the discussion! >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Shawn >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi all, >>>>>>>>>>>>> I'd like to start a dedicated sync to discuss Iceberg Index >>>>>>>>>>>>> support. Here is the existing discussion thread: >>>>>>>>>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty >>>>>>>>>>>>> . >>>>>>>>>>>>> >>>>>>>>>>>>> To ground the discussion, here are the two proposals: >>>>>>>>>>>>> >>>>>>>>>>>>> - Peter's proposal >>>>>>>>>>>>> >>>>>>>>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2> >>>>>>>>>>>>> (overall >>>>>>>>>>>>> index support) >>>>>>>>>>>>> - My proposal >>>>>>>>>>>>> >>>>>>>>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7> >>>>>>>>>>>>> (bloom filter skipping index) >>>>>>>>>>>>> >>>>>>>>>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST, >>>>>>>>>>>>> starting next Wednesday (2/11). After FileFormat sync finishes, >>>>>>>>>>>>> we plan to >>>>>>>>>>>>> use that slot and switch to every other Monday, 9 AM to 10 AM EST. >>>>>>>>>>>>> >>>>>>>>>>>>> Meet link: https://meet.google.com/fjn-tyze-mko >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Huaxin >>>>>>>>>>>>> >>>>>>>>>>>>
