Re: [DISCUSS] V4 - indexing support

Aihua Xu Sat, 01 Nov 2025 20:11:13 -0700

Thanks Steven for raising this topic and giving a summary on the proposals.
I would like to get involved in this area.


On Fri, Oct 31, 2025 at 4:49 PM huaxin gao <[email protected]> wrote:

> Thanks, Steven, for taking the initiative. I have previously collaborated
> with Miao from Adobe on secondary index and would like to continue that
> work.
>
> Huaxin
>
> On Fri, Oct 31, 2025 at 1:07 PM Xinli shang <[email protected]>
> wrote:
>
>> Thanks Steven for proposing this! This is right direction to go.
>> Definitely we see challenges in some cases without indexing support,
>> especially around equality deletes and point lookups. I would like to
>> contribute as well. One thing we need to be careful is that the overhead of
>> the index itself like memory usage, index update etc.
>>
>> Namratha, for Parquet column index, we had one for Presto
>> https://www.youtube.com/watch?v=fr_HdhMEa3s.
>>
>>
>>
>>
>> On Fri, Oct 31, 2025 at 11:48 AM namratha mk <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I see the point in the doc :
>>>
>>> *The primary key index can also be useful for point lookup.*
>>> But to achieve the above we would need to store native file format
>>> metadata like parquet page index
>>> <https://parquet.apache.org/docs/file-format/pageindex/> in the primary
>>> index which helps in fetching for lookup use case. Has there been any talks
>>> in the community about this? Would like to get more opinions on this.
>>>
>>> Thanks,
>>> Namratha
>>>
>>> On Sat, Jul 19, 2025 at 2:39 AM Manish Malhotra <
>>> [email protected]> wrote:
>>>
>>>> Thanks Steven,
>>>> +1 on this initiative, I am also interested to contribute in this area.
>>>> As you mentioned it has a quite a breadth, my though is we can start a
>>>> document to  discuss different layers separately like type of indexes, sync
>>>> vs async, spec changes, priority of the index to be supported (instead of
>>>> targeting all in one go)
>>>>
>>>> Thanks,
>>>> Manish
>>>>
>>>> On Fri, Jul 18, 2025 at 10:41 PM Steven Wu <[email protected]>
>>>> wrote:
>>>>
>>>>> Vignesh, that is yet to be discussed. We haven't got to that kind of
>>>>> detail yet.
>>>>>
>>>>> In some cases, the index files are expected to be added along with the
>>>>> data files in the same commit. Maybe some cases (like secondary index)
>>>>> would prefer async process.
>>>>>
>>>>> On Fri, Jul 18, 2025 at 4:11 PM Vignesh <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Are the index files for all kinds expected to be written and added
>>>>>> along with data files or would it be an optional async step?
>>>>>>
>>>>>> On Fri, Jul 18, 2025, 5:09 AM Péter Váry <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> > *Primary Index*: Conventionally Primary Index - just means what
>>>>>>> the Table's Primary storage layout/organization was. Given that Iceberg
>>>>>>> supports Sort-order - if the Spec adds constraints to derive/influence 
>>>>>>> Sort
>>>>>>> order based on the Identifier columns - it satisfies the Primary Index
>>>>>>> criteria.
>>>>>>>
>>>>>>> Here is my mental model:
>>>>>>> - Primary Key - the unique identifier for the rows
>>>>>>> - Primary Key index - database index constructed on the Primary Key
>>>>>>> column
>>>>>>> - Iceberg sort order - performance optimization used to speed up
>>>>>>> frequent, or costly queries.
>>>>>>>
>>>>>>> The Iceberg sort order is often defined above different columns than
>>>>>>> the Primary Key, so I would try to avoid mixing the two concepts.
>>>>>>>
>>>>>>> > we found that an Iceberg Table based Store Secondary Index -
>>>>>>> provides the right balance between the ability to skip over and load 
>>>>>>> needed
>>>>>>> sections and yet provide the right performance benefits.
>>>>>>>
>>>>>>> Could you please elaborate on what "Iceberg Table based Store
>>>>>>> Secondary Index" means?
>>>>>>> Is this another Iceberg table with different columns and different
>>>>>>> sort order?
>>>>>>>
>>>>>>> > they want it to be in an open format, so that it can be shared
>>>>>>> with other engines!
>>>>>>>
>>>>>>> Wholeheartedly agreed!
>>>>>>>
>>>>>>> Thanks Steven for starting, and others for participating in the
>>>>>>> discussion!
>>>>>>> PEter
>>>>>>>
>>>>>>> Sreeram Garlapati <[email protected]> ezt írta (időpont:
>>>>>>> 2025. júl. 15., K, 22:12):
>>>>>>>
>>>>>>>> Thanks Steven for starting this.
>>>>>>>>
>>>>>>>> I am interested in the - Index'ing related conversations.
>>>>>>>>
>>>>>>>> Here are some preliminary thoughts:
>>>>>>>>
>>>>>>>>    1. *Primary Index*: Conventionally Primary Index - just means
>>>>>>>>    what the Table's Primary storage layout/organization was. Given that
>>>>>>>>    Iceberg supports Sort-order - if the Spec adds constraints to
>>>>>>>>    derive/influence Sort order based on the Identifier columns - it 
>>>>>>>> satisfies
>>>>>>>>    the Primary Index criteria.
>>>>>>>>    2. *Secondary Index*: Secondary Index storage calls for an
>>>>>>>>    efficient organization which can hold Secondary Keys along with the
>>>>>>>>    Location of the Row and any included columns. The index can be of 
>>>>>>>> many
>>>>>>>>    types, based on the Data. Iceberg tables are typically v.v.large. 
>>>>>>>> Hence,
>>>>>>>>    these Indexes also tend to be very large. Based on our past 1-2 
>>>>>>>> years of
>>>>>>>>    work in this space, we found that an Iceberg Table based Store 
>>>>>>>> Secondary
>>>>>>>>    Index - provides the right balance between the ability to skip over 
>>>>>>>> and
>>>>>>>>    load needed sections and yet provide the right performance 
>>>>>>>> benefits. This
>>>>>>>>    decision was also shaped by popular opinion from many of our 
>>>>>>>> partners &
>>>>>>>>    customers - as the Index computation involves a lot of computation, 
>>>>>>>> they
>>>>>>>>    want it to be in an open format, so that it can be shared with other
>>>>>>>>    engines!
>>>>>>>>    3. *Others: Full Text Search Indexes and Vector Indexes*: It is
>>>>>>>>    critical that we allow years of innovation in the space of Full 
>>>>>>>> Text Search
>>>>>>>>    and Vector indexes, especially with the current acceleration in AI 
>>>>>>>> adoption
>>>>>>>>    & the need it is driving on the Keyword and Similarity Search 
>>>>>>>> space. Given
>>>>>>>>    that Iceberg tables are extremely large, it is critical for us to 
>>>>>>>> provide a
>>>>>>>>    good story for Indexes that can be incrementally updated / 
>>>>>>>> partially loaded
>>>>>>>>    into memory.
>>>>>>>>
>>>>>>>>
>>>>>>>> Looking forward to the discussions.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Sreeram
>>>>>>>>
>>>>>>>> On Tue, Jul 15, 2025 at 9:33 AM Anurag Mantripragada
>>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks for starting this thread, Steven!
>>>>>>>>>
>>>>>>>>> I have been interested in secondary indexing in Iceberg. There was
>>>>>>>>> an old proposal secondary indexing [1], we may need to revist/redesign
>>>>>>>>> these structures. I agree this is a very broad topic and having 
>>>>>>>>> indexing
>>>>>>>>> structures general enough to support a wide range of use-cases will 
>>>>>>>>> be a
>>>>>>>>> key challenge.
>>>>>>>>>
>>>>>>>>> I would like to get involved any discussions related to indexing.
>>>>>>>>>
>>>>>>>>> [1] -
>>>>>>>>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Anurag Mantripragada
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jul 15, 2025, at 2:37 AM, Maximilian Michels <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Thanks Steven for the summary. It would be great to extend the
>>>>>>>>> Iceberg spec with index files, such that they can be used for the 
>>>>>>>>> different
>>>>>>>>> use cases.
>>>>>>>>>
>>>>>>>>> For my understanding, let me further outline the different types
>>>>>>>>> of use cases for index files:
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> Topic 1: Accelerating the resolution of equality deletes
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> In its current form, equality deletes make it impossible to
>>>>>>>>> achieve proper merge-on-read performance in streaming reads, and they 
>>>>>>>>> also
>>>>>>>>> add a significant performance overhead in batch pipelines.
>>>>>>>>>
>>>>>>>>> Approach (a):
>>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>>>>>>> Converting equality deletes to positional deletes would be a great
>>>>>>>>> achievement. I'm wondering though, if all engines will be able to 
>>>>>>>>> achieve
>>>>>>>>> this. There is quite some runtime complexity involved to achieve 
>>>>>>>>> this. If I
>>>>>>>>> understand correctly, the index can be bootstrapped via table 
>>>>>>>>> maintenance
>>>>>>>>> tasks, then has to be maintained by the streaming writer.
>>>>>>>>>
>>>>>>>>> Approach (b):
>>>>>>>>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv
>>>>>>>>> This would boost the resolution of equality deletes during reads
>>>>>>>>> via indices. The indices can be built via maintenance tasks, or 
>>>>>>>>> directly by
>>>>>>>>> the writer as in (a). But how to keep the index fresh if we don't 
>>>>>>>>> write the
>>>>>>>>> index at the writers? Readers won't always be able to use an
>>>>>>>>> up-to-date index, making this less suitable for streaming reads.
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> Topic 2: Full text search in table scans
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit
>>>>>>>>> Adding full-text search would broaden Iceberg’s applicability,
>>>>>>>>> enabling new search use cases and making table scans far more 
>>>>>>>>> powerful.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Max
>>>>>>>>>
>>>>>>>>> On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Similar to other V4 threads, I am starting a thread to gauge
>>>>>>>>>> interest in adding index support in Iceberg V4 and gather a focus 
>>>>>>>>>> group in
>>>>>>>>>> this area.
>>>>>>>>>>
>>>>>>>>>> There have been a few discussions related to indexing recently.
>>>>>>>>>>
>>>>>>>>>>    - Me and Peter Vary are working on a proposal (WIP) to
>>>>>>>>>>    only write position deletes in the Flink streaming writer. It 
>>>>>>>>>> would need a
>>>>>>>>>>    primary key index to work reasonably efficiently. [1]
>>>>>>>>>>    - Xiaoxuan Li has a proposal to leverage index files to
>>>>>>>>>>    improve merge-on-read performance with equality deletes. [2]
>>>>>>>>>>    - pengzhiwei has a proposal to support full-text index and
>>>>>>>>>>    vector index. [3]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Idea: index files*
>>>>>>>>>>
>>>>>>>>>> To support those use cases, Iceberg can add support for index
>>>>>>>>>> files (in addition to data files and delete files). It should be 
>>>>>>>>>> general
>>>>>>>>>> enough to support different forms of indexing.
>>>>>>>>>>
>>>>>>>>>>    - Primary key index
>>>>>>>>>>    - Secondary index
>>>>>>>>>>    - Full text index
>>>>>>>>>>    - Vector index
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This email is a starting point. It is a large topic. A lot of
>>>>>>>>>> discussions and maturation of the ideas are needed before a formal 
>>>>>>>>>> proposal.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Steven
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>>>>>>>> (WIP)
>>>>>>>>>> [2]
>>>>>>>>>> https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq
>>>>>>>>>> [3] https://github.com/apache/iceberg/issues/12636
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>
>> --
>> Xinli Shang
>>
>

Re: [DISCUSS] V4 - indexing support

Reply via email to