Thanks for starting this thread, Steven!

I have been interested in secondary indexing in Iceberg. There was an old 
proposal secondary indexing [1], we may need to revist/redesign these 
structures. I agree this is a very broad topic and having indexing structures 
general enough to support a wide range of use-cases will be a key challenge. 

I would like to get involved any discussions related to indexing. 

[1] - 
https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0


Thanks, 
Anurag Mantripragada


> On Jul 15, 2025, at 2:37 AM, Maximilian Michels <m...@apache.org> wrote:
> 
> Thanks Steven for the summary. It would be great to extend the Iceberg spec 
> with index files, such that they can be used for the different use cases.
> 
> For my understanding, let me further outline the different types of use cases 
> for index files:
> 
> --- 
> Topic 1: Accelerating the resolution of equality deletes
> ---
> 
> In its current form, equality deletes make it impossible to achieve proper 
> merge-on-read performance in streaming reads, and they also add a significant 
> performance overhead in batch pipelines. 
> 
> Approach (a): 
> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
> Converting equality deletes to positional deletes would be a great 
> achievement. I'm wondering though, if all engines will be able to achieve 
> this. There is quite some runtime complexity involved to achieve this. If I 
> understand correctly, the index can be bootstrapped via table maintenance 
> tasks, then has to be maintained by the streaming writer.
> 
> Approach (b): https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv
> This would boost the resolution of equality deletes during reads via indices. 
> The indices can be built via maintenance tasks, or directly by the writer as 
> in (a). But how to keep the index fresh if we don't write the index at the 
> writers? Readers won't always be able to use an up-to-date index, making this 
> less suitable for streaming reads.
> 
> --- 
> Topic 2: Full text search in table scans
> ---
> 
> https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit
> Adding full-text search would broaden Iceberg’s applicability, enabling new 
> search use cases and making table scans far more powerful.
> 
> Cheers,
> Max
> 
> On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <stevenz...@gmail.com 
> <mailto:stevenz...@gmail.com>> wrote:
>> 
>> Similar to other V4 threads, I am starting a thread to gauge interest in 
>> adding index support in Iceberg V4 and gather a focus group in this area.
>> 
>> There have been a few discussions related to indexing recently.
>> Me and Peter Vary are working on a proposal (WIP) to only write position 
>> deletes in the Flink streaming writer. It would need a primary key index to 
>> work reasonably efficiently. [1]
>> Xiaoxuan Li has a proposal to leverage index files to improve merge-on-read 
>> performance with equality deletes. [2]
>> pengzhiwei has a proposal to support full-text index and vector index. [3]
>> 
>> Idea: index files
>> 
>> To support those use cases, Iceberg can add support for index files (in 
>> addition to data files and delete files). It should be general enough to 
>> support different forms of indexing.
>> Primary key index
>> Secondary index
>> Full text index
>> Vector index
>> 
>> This email is a starting point. It is a large topic. A lot of discussions 
>> and maturation of the ideas are needed before a formal proposal.
>> 
>> Thanks,
>> Steven
>> 
>> [1] 
>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>  (WIP)
>> [2] https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq
>> [3] https://github.com/apache/iceberg/issues/12636
>> 
>> 

Reply via email to