Re: [DISCUSS] V4 - indexing support

Xinli shang Fri, 31 Oct 2025 13:07:25 -0700

Thanks Steven for proposing this! This is right direction to go. Definitely
we see challenges in some cases without indexing support, especially around
equality deletes and point lookups. I would like to contribute as well. One
thing we need to be careful is that the overhead of the index itself like
memory usage, index update etc.


Namratha, for Parquet column index, we had one for Presto
https://www.youtube.com/watch?v=fr_HdhMEa3s.




On Fri, Oct 31, 2025 at 11:48 AM namratha mk <[email protected]> wrote:

> Hi,
>
> I see the point in the doc :
>
> *The primary key index can also be useful for point lookup.*
> But to achieve the above we would need to store native file format
> metadata like parquet page index
> <https://parquet.apache.org/docs/file-format/pageindex/> in the primary
> index which helps in fetching for lookup use case. Has there been any talks
> in the community about this? Would like to get more opinions on this.
>
> Thanks,
> Namratha
>
> On Sat, Jul 19, 2025 at 2:39 AM Manish Malhotra <
> [email protected]> wrote:
>
>> Thanks Steven,
>> +1 on this initiative, I am also interested to contribute in this area.
>> As you mentioned it has a quite a breadth, my though is we can start a
>> document to  discuss different layers separately like type of indexes, sync
>> vs async, spec changes, priority of the index to be supported (instead of
>> targeting all in one go)
>>
>> Thanks,
>> Manish
>>
>> On Fri, Jul 18, 2025 at 10:41 PM Steven Wu <[email protected]> wrote:
>>
>>> Vignesh, that is yet to be discussed. We haven't got to that kind of
>>> detail yet.
>>>
>>> In some cases, the index files are expected to be added along with the
>>> data files in the same commit. Maybe some cases (like secondary index)
>>> would prefer async process.
>>>
>>> On Fri, Jul 18, 2025 at 4:11 PM Vignesh <[email protected]> wrote:
>>>
>>>> Are the index files for all kinds expected to be written and added
>>>> along with data files or would it be an optional async step?
>>>>
>>>> On Fri, Jul 18, 2025, 5:09 AM Péter Váry <[email protected]>
>>>> wrote:
>>>>
>>>>> > *Primary Index*: Conventionally Primary Index - just means what the
>>>>> Table's Primary storage layout/organization was. Given that Iceberg
>>>>> supports Sort-order - if the Spec adds constraints to derive/influence 
>>>>> Sort
>>>>> order based on the Identifier columns - it satisfies the Primary Index
>>>>> criteria.
>>>>>
>>>>> Here is my mental model:
>>>>> - Primary Key - the unique identifier for the rows
>>>>> - Primary Key index - database index constructed on the Primary Key
>>>>> column
>>>>> - Iceberg sort order - performance optimization used to speed up
>>>>> frequent, or costly queries.
>>>>>
>>>>> The Iceberg sort order is often defined above different columns than
>>>>> the Primary Key, so I would try to avoid mixing the two concepts.
>>>>>
>>>>> > we found that an Iceberg Table based Store Secondary Index -
>>>>> provides the right balance between the ability to skip over and load 
>>>>> needed
>>>>> sections and yet provide the right performance benefits.
>>>>>
>>>>> Could you please elaborate on what "Iceberg Table based Store
>>>>> Secondary Index" means?
>>>>> Is this another Iceberg table with different columns and different
>>>>> sort order?
>>>>>
>>>>> > they want it to be in an open format, so that it can be shared with
>>>>> other engines!
>>>>>
>>>>> Wholeheartedly agreed!
>>>>>
>>>>> Thanks Steven for starting, and others for participating in the
>>>>> discussion!
>>>>> PEter
>>>>>
>>>>> Sreeram Garlapati <[email protected]> ezt írta (időpont: 2025.
>>>>> júl. 15., K, 22:12):
>>>>>
>>>>>> Thanks Steven for starting this.
>>>>>>
>>>>>> I am interested in the - Index'ing related conversations.
>>>>>>
>>>>>> Here are some preliminary thoughts:
>>>>>>
>>>>>>    1. *Primary Index*: Conventionally Primary Index - just means
>>>>>>    what the Table's Primary storage layout/organization was. Given that
>>>>>>    Iceberg supports Sort-order - if the Spec adds constraints to
>>>>>>    derive/influence Sort order based on the Identifier columns - it 
>>>>>> satisfies
>>>>>>    the Primary Index criteria.
>>>>>>    2. *Secondary Index*: Secondary Index storage calls for an
>>>>>>    efficient organization which can hold Secondary Keys along with the
>>>>>>    Location of the Row and any included columns. The index can be of many
>>>>>>    types, based on the Data. Iceberg tables are typically v.v.large. 
>>>>>> Hence,
>>>>>>    these Indexes also tend to be very large. Based on our past 1-2 years 
>>>>>> of
>>>>>>    work in this space, we found that an Iceberg Table based Store 
>>>>>> Secondary
>>>>>>    Index - provides the right balance between the ability to skip over 
>>>>>> and
>>>>>>    load needed sections and yet provide the right performance benefits. 
>>>>>> This
>>>>>>    decision was also shaped by popular opinion from many of our partners 
>>>>>> &
>>>>>>    customers - as the Index computation involves a lot of computation, 
>>>>>> they
>>>>>>    want it to be in an open format, so that it can be shared with other
>>>>>>    engines!
>>>>>>    3. *Others: Full Text Search Indexes and Vector Indexes*: It is
>>>>>>    critical that we allow years of innovation in the space of Full Text 
>>>>>> Search
>>>>>>    and Vector indexes, especially with the current acceleration in AI 
>>>>>> adoption
>>>>>>    & the need it is driving on the Keyword and Similarity Search space. 
>>>>>> Given
>>>>>>    that Iceberg tables are extremely large, it is critical for us to 
>>>>>> provide a
>>>>>>    good story for Indexes that can be incrementally updated / partially 
>>>>>> loaded
>>>>>>    into memory.
>>>>>>
>>>>>>
>>>>>> Looking forward to the discussions.
>>>>>>
>>>>>> Best,
>>>>>> Sreeram
>>>>>>
>>>>>> On Tue, Jul 15, 2025 at 9:33 AM Anurag Mantripragada
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks for starting this thread, Steven!
>>>>>>>
>>>>>>> I have been interested in secondary indexing in Iceberg. There was
>>>>>>> an old proposal secondary indexing [1], we may need to revist/redesign
>>>>>>> these structures. I agree this is a very broad topic and having indexing
>>>>>>> structures general enough to support a wide range of use-cases will be a
>>>>>>> key challenge.
>>>>>>>
>>>>>>> I would like to get involved any discussions related to indexing.
>>>>>>>
>>>>>>> [1] -
>>>>>>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Anurag Mantripragada
>>>>>>>
>>>>>>>
>>>>>>> On Jul 15, 2025, at 2:37 AM, Maximilian Michels <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks Steven for the summary. It would be great to extend the
>>>>>>> Iceberg spec with index files, such that they can be used for the 
>>>>>>> different
>>>>>>> use cases.
>>>>>>>
>>>>>>> For my understanding, let me further outline the different types of
>>>>>>> use cases for index files:
>>>>>>>
>>>>>>> ---
>>>>>>> Topic 1: Accelerating the resolution of equality deletes
>>>>>>> ---
>>>>>>>
>>>>>>> In its current form, equality deletes make it impossible to achieve
>>>>>>> proper merge-on-read performance in streaming reads, and they also add a
>>>>>>> significant performance overhead in batch pipelines.
>>>>>>>
>>>>>>> Approach (a):
>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>>>>> Converting equality deletes to positional deletes would be a great
>>>>>>> achievement. I'm wondering though, if all engines will be able to 
>>>>>>> achieve
>>>>>>> this. There is quite some runtime complexity involved to achieve this. 
>>>>>>> If I
>>>>>>> understand correctly, the index can be bootstrapped via table 
>>>>>>> maintenance
>>>>>>> tasks, then has to be maintained by the streaming writer.
>>>>>>>
>>>>>>> Approach (b):
>>>>>>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv
>>>>>>> This would boost the resolution of equality deletes during reads via
>>>>>>> indices. The indices can be built via maintenance tasks, or directly by 
>>>>>>> the
>>>>>>> writer as in (a). But how to keep the index fresh if we don't write the
>>>>>>> index at the writers? Readers won't always be able to use an
>>>>>>> up-to-date index, making this less suitable for streaming reads.
>>>>>>>
>>>>>>> ---
>>>>>>> Topic 2: Full text search in table scans
>>>>>>> ---
>>>>>>>
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit
>>>>>>> Adding full-text search would broaden Iceberg’s applicability,
>>>>>>> enabling new search use cases and making table scans far more powerful.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Max
>>>>>>>
>>>>>>> On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Similar to other V4 threads, I am starting a thread to gauge
>>>>>>>> interest in adding index support in Iceberg V4 and gather a focus 
>>>>>>>> group in
>>>>>>>> this area.
>>>>>>>>
>>>>>>>> There have been a few discussions related to indexing recently.
>>>>>>>>
>>>>>>>>    - Me and Peter Vary are working on a proposal (WIP) to
>>>>>>>>    only write position deletes in the Flink streaming writer. It would 
>>>>>>>> need a
>>>>>>>>    primary key index to work reasonably efficiently. [1]
>>>>>>>>    - Xiaoxuan Li has a proposal to leverage index files to improve
>>>>>>>>    merge-on-read performance with equality deletes. [2]
>>>>>>>>    - pengzhiwei has a proposal to support full-text index and
>>>>>>>>    vector index. [3]
>>>>>>>>
>>>>>>>>
>>>>>>>> *Idea: index files*
>>>>>>>>
>>>>>>>> To support those use cases, Iceberg can add support for index files
>>>>>>>> (in addition to data files and delete files). It should be general 
>>>>>>>> enough
>>>>>>>> to support different forms of indexing.
>>>>>>>>
>>>>>>>>    - Primary key index
>>>>>>>>    - Secondary index
>>>>>>>>    - Full text index
>>>>>>>>    - Vector index
>>>>>>>>
>>>>>>>>
>>>>>>>> This email is a starting point. It is a large topic. A lot of
>>>>>>>> discussions and maturation of the ideas are needed before a formal 
>>>>>>>> proposal.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Steven
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>>>>>> (WIP)
>>>>>>>> [2]
>>>>>>>> https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq
>>>>>>>> [3] https://github.com/apache/iceberg/issues/12636
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>

-- 
Xinli Shang

Re: [DISCUSS] V4 - indexing support

Reply via email to