Thanks Steven for the summary. It would be great to extend the Iceberg spec
with index files, such that they can be used for the different use cases.

For my understanding, let me further outline the different types of use
cases for index files:

---
Topic 1: Accelerating the resolution of equality deletes
---

In its current form, equality deletes make it impossible to achieve proper
merge-on-read performance in streaming reads, and they also add a
significant performance overhead in batch pipelines.

Approach (a):
https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
Converting equality deletes to positional deletes would be a great
achievement. I'm wondering though, if all engines will be able to achieve
this. There is quite some runtime complexity involved to achieve this. If I
understand correctly, the index can be bootstrapped via table maintenance
tasks, then has to be maintained by the streaming writer.

Approach (b):
https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv
This would boost the resolution of equality deletes during reads via
indices. The indices can be built via maintenance tasks, or directly by the
writer as in (a). But how to keep the index fresh if we don't write the
index at the writers? Readers won't always be able to use an
up-to-date index, making this less suitable for streaming reads.

---
Topic 2: Full text search in table scans
---

https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit
Adding full-text search would broaden Iceberg’s applicability, enabling new
search use cases and making table scans far more powerful.

Cheers,
Max

On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <stevenz...@gmail.com> wrote:

>
> Similar to other V4 threads, I am starting a thread to gauge interest in
> adding index support in Iceberg V4 and gather a focus group in this area.
>
> There have been a few discussions related to indexing recently.
>
>    - Me and Peter Vary are working on a proposal (WIP) to only write
>    position deletes in the Flink streaming writer. It would need a primary key
>    index to work reasonably efficiently. [1]
>    - Xiaoxuan Li has a proposal to leverage index files to improve
>    merge-on-read performance with equality deletes. [2]
>    - pengzhiwei has a proposal to support full-text index and vector
>    index. [3]
>
>
> *Idea: index files*
>
> To support those use cases, Iceberg can add support for index files (in
> addition to data files and delete files). It should be general enough to
> support different forms of indexing.
>
>    - Primary key index
>    - Secondary index
>    - Full text index
>    - Vector index
>
>
> This email is a starting point. It is a large topic. A lot of discussions
> and maturation of the ideas are needed before a formal proposal.
>
> Thanks,
> Steven
>
> [1]
> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
> (WIP)
> [2] https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq
> [3] https://github.com/apache/iceberg/issues/12636
>
>
>

Reply via email to