Thanks Steven for starting this. I am interested in the - Index'ing related conversations.
Here are some preliminary thoughts: 1. *Primary Index*: Conventionally Primary Index - just means what the Table's Primary storage layout/organization was. Given that Iceberg supports Sort-order - if the Spec adds constraints to derive/influence Sort order based on the Identifier columns - it satisfies the Primary Index criteria. 2. *Secondary Index*: Secondary Index storage calls for an efficient organization which can hold Secondary Keys along with the Location of the Row and any included columns. The index can be of many types, based on the Data. Iceberg tables are typically v.v.large. Hence, these Indexes also tend to be very large. Based on our past 1-2 years of work in this space, we found that an Iceberg Table based Store Secondary Index - provides the right balance between the ability to skip over and load needed sections and yet provide the right performance benefits. This decision was also shaped by popular opinion from many of our partners & customers - as the Index computation involves a lot of computation, they want it to be in an open format, so that it can be shared with other engines! 3. *Others: Full Text Search Indexes and Vector Indexes*: It is critical that we allow years of innovation in the space of Full Text Search and Vector indexes, especially with the current acceleration in AI adoption & the need it is driving on the Keyword and Similarity Search space. Given that Iceberg tables are extremely large, it is critical for us to provide a good story for Indexes that can be incrementally updated / partially loaded into memory. Looking forward to the discussions. Best, Sreeram On Tue, Jul 15, 2025 at 9:33 AM Anurag Mantripragada <amantriprag...@apple.com.invalid> wrote: > Thanks for starting this thread, Steven! > > I have been interested in secondary indexing in Iceberg. There was an old > proposal secondary indexing [1], we may need to revist/redesign these > structures. I agree this is a very broad topic and having indexing > structures general enough to support a wide range of use-cases will be a > key challenge. > > I would like to get involved any discussions related to indexing. > > [1] - > https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0 > > > Thanks, > Anurag Mantripragada > > > On Jul 15, 2025, at 2:37 AM, Maximilian Michels <m...@apache.org> wrote: > > Thanks Steven for the summary. It would be great to extend the Iceberg > spec with index files, such that they can be used for the different use > cases. > > For my understanding, let me further outline the different types of use > cases for index files: > > --- > Topic 1: Accelerating the resolution of equality deletes > --- > > In its current form, equality deletes make it impossible to achieve proper > merge-on-read performance in streaming reads, and they also add a > significant performance overhead in batch pipelines. > > Approach (a): > https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ > Converting equality deletes to positional deletes would be a great > achievement. I'm wondering though, if all engines will be able to achieve > this. There is quite some runtime complexity involved to achieve this. If I > understand correctly, the index can be bootstrapped via table maintenance > tasks, then has to be maintained by the streaming writer. > > Approach (b): > https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv > This would boost the resolution of equality deletes during reads via > indices. The indices can be built via maintenance tasks, or directly by the > writer as in (a). But how to keep the index fresh if we don't write the > index at the writers? Readers won't always be able to use an > up-to-date index, making this less suitable for streaming reads. > > --- > Topic 2: Full text search in table scans > --- > > > https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit > Adding full-text search would broaden Iceberg’s applicability, enabling > new search use cases and making table scans far more powerful. > > Cheers, > Max > > On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <stevenz...@gmail.com> wrote: > >> >> Similar to other V4 threads, I am starting a thread to gauge interest in >> adding index support in Iceberg V4 and gather a focus group in this area. >> >> There have been a few discussions related to indexing recently. >> >> - Me and Peter Vary are working on a proposal (WIP) to only write >> position deletes in the Flink streaming writer. It would need a primary >> key >> index to work reasonably efficiently. [1] >> - Xiaoxuan Li has a proposal to leverage index files to improve >> merge-on-read performance with equality deletes. [2] >> - pengzhiwei has a proposal to support full-text index and vector >> index. [3] >> >> >> *Idea: index files* >> >> To support those use cases, Iceberg can add support for index files (in >> addition to data files and delete files). It should be general enough to >> support different forms of indexing. >> >> - Primary key index >> - Secondary index >> - Full text index >> - Vector index >> >> >> This email is a starting point. It is a large topic. A lot of discussions >> and maturation of the ideas are needed before a formal proposal. >> >> Thanks, >> Steven >> >> [1] >> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ >> (WIP) >> [2] https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq >> [3] https://github.com/apache/iceberg/issues/12636 >> >> >> >