Thanks Steven for the summary. It would be great to extend the Iceberg spec with index files, such that they can be used for the different use cases.
For my understanding, let me further outline the different types of use cases for index files: --- Topic 1: Accelerating the resolution of equality deletes --- In its current form, equality deletes make it impossible to achieve proper merge-on-read performance in streaming reads, and they also add a significant performance overhead in batch pipelines. Approach (a): https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ Converting equality deletes to positional deletes would be a great achievement. I'm wondering though, if all engines will be able to achieve this. There is quite some runtime complexity involved to achieve this. If I understand correctly, the index can be bootstrapped via table maintenance tasks, then has to be maintained by the streaming writer. Approach (b): https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv This would boost the resolution of equality deletes during reads via indices. The indices can be built via maintenance tasks, or directly by the writer as in (a). But how to keep the index fresh if we don't write the index at the writers? Readers won't always be able to use an up-to-date index, making this less suitable for streaming reads. --- Topic 2: Full text search in table scans --- https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit Adding full-text search would broaden Iceberg’s applicability, enabling new search use cases and making table scans far more powerful. Cheers, Max On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <stevenz...@gmail.com> wrote: > > Similar to other V4 threads, I am starting a thread to gauge interest in > adding index support in Iceberg V4 and gather a focus group in this area. > > There have been a few discussions related to indexing recently. > > - Me and Peter Vary are working on a proposal (WIP) to only write > position deletes in the Flink streaming writer. It would need a primary key > index to work reasonably efficiently. [1] > - Xiaoxuan Li has a proposal to leverage index files to improve > merge-on-read performance with equality deletes. [2] > - pengzhiwei has a proposal to support full-text index and vector > index. [3] > > > *Idea: index files* > > To support those use cases, Iceberg can add support for index files (in > addition to data files and delete files). It should be general enough to > support different forms of indexing. > > - Primary key index > - Secondary index > - Full text index > - Vector index > > > This email is a starting point. It is a large topic. A lot of discussions > and maturation of the ideas are needed before a formal proposal. > > Thanks, > Steven > > [1] > https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ > (WIP) > [2] https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq > [3] https://github.com/apache/iceberg/issues/12636 > > >