Re: [DISCUSS] V4 - indexing support

huaxin gao Fri, 31 Oct 2025 16:49:50 -0700

Thanks, Steven, for taking the initiative. I have previously collaborated
with Miao from Adobe on secondary index and would like to continue that
work.


Huaxin

On Fri, Oct 31, 2025 at 1:07 PM Xinli shang <[email protected]> wrote:

> Thanks Steven for proposing this! This is right direction to go.
> Definitely we see challenges in some cases without indexing support,
> especially around equality deletes and point lookups. I would like to
> contribute as well. One thing we need to be careful is that the overhead of
> the index itself like memory usage, index update etc.
>
> Namratha, for Parquet column index, we had one for Presto
> https://www.youtube.com/watch?v=fr_HdhMEa3s.
>
>
>
>
> On Fri, Oct 31, 2025 at 11:48 AM namratha mk <[email protected]> wrote:
>
>> Hi,
>>
>> I see the point in the doc :
>>
>> *The primary key index can also be useful for point lookup.*
>> But to achieve the above we would need to store native file format
>> metadata like parquet page index
>> <https://parquet.apache.org/docs/file-format/pageindex/> in the primary
>> index which helps in fetching for lookup use case. Has there been any talks
>> in the community about this? Would like to get more opinions on this.
>>
>> Thanks,
>> Namratha
>>
>> On Sat, Jul 19, 2025 at 2:39 AM Manish Malhotra <
>> [email protected]> wrote:
>>
>>> Thanks Steven,
>>> +1 on this initiative, I am also interested to contribute in this area.
>>> As you mentioned it has a quite a breadth, my though is we can start a
>>> document to  discuss different layers separately like type of indexes, sync
>>> vs async, spec changes, priority of the index to be supported (instead of
>>> targeting all in one go)
>>>
>>> Thanks,
>>> Manish
>>>
>>> On Fri, Jul 18, 2025 at 10:41 PM Steven Wu <[email protected]> wrote:
>>>
>>>> Vignesh, that is yet to be discussed. We haven't got to that kind of
>>>> detail yet.
>>>>
>>>> In some cases, the index files are expected to be added along with the
>>>> data files in the same commit. Maybe some cases (like secondary index)
>>>> would prefer async process.
>>>>
>>>> On Fri, Jul 18, 2025 at 4:11 PM Vignesh <[email protected]> wrote:
>>>>
>>>>> Are the index files for all kinds expected to be written and added
>>>>> along with data files or would it be an optional async step?
>>>>>
>>>>> On Fri, Jul 18, 2025, 5:09 AM Péter Váry <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> > *Primary Index*: Conventionally Primary Index - just means what
>>>>>> the Table's Primary storage layout/organization was. Given that Iceberg
>>>>>> supports Sort-order - if the Spec adds constraints to derive/influence 
>>>>>> Sort
>>>>>> order based on the Identifier columns - it satisfies the Primary Index
>>>>>> criteria.
>>>>>>
>>>>>> Here is my mental model:
>>>>>> - Primary Key - the unique identifier for the rows
>>>>>> - Primary Key index - database index constructed on the Primary Key
>>>>>> column
>>>>>> - Iceberg sort order - performance optimization used to speed up
>>>>>> frequent, or costly queries.
>>>>>>
>>>>>> The Iceberg sort order is often defined above different columns than
>>>>>> the Primary Key, so I would try to avoid mixing the two concepts.
>>>>>>
>>>>>> > we found that an Iceberg Table based Store Secondary Index -
>>>>>> provides the right balance between the ability to skip over and load 
>>>>>> needed
>>>>>> sections and yet provide the right performance benefits.
>>>>>>
>>>>>> Could you please elaborate on what "Iceberg Table based Store
>>>>>> Secondary Index" means?
>>>>>> Is this another Iceberg table with different columns and different
>>>>>> sort order?
>>>>>>
>>>>>> > they want it to be in an open format, so that it can be shared with
>>>>>> other engines!
>>>>>>
>>>>>> Wholeheartedly agreed!
>>>>>>
>>>>>> Thanks Steven for starting, and others for participating in the
>>>>>> discussion!
>>>>>> PEter
>>>>>>
>>>>>> Sreeram Garlapati <[email protected]> ezt írta (időpont: 2025.
>>>>>> júl. 15., K, 22:12):
>>>>>>
>>>>>>> Thanks Steven for starting this.
>>>>>>>
>>>>>>> I am interested in the - Index'ing related conversations.
>>>>>>>
>>>>>>> Here are some preliminary thoughts:
>>>>>>>
>>>>>>>    1. *Primary Index*: Conventionally Primary Index - just means
>>>>>>>    what the Table's Primary storage layout/organization was. Given that
>>>>>>>    Iceberg supports Sort-order - if the Spec adds constraints to
>>>>>>>    derive/influence Sort order based on the Identifier columns - it 
>>>>>>> satisfies
>>>>>>>    the Primary Index criteria.
>>>>>>>    2. *Secondary Index*: Secondary Index storage calls for an
>>>>>>>    efficient organization which can hold Secondary Keys along with the
>>>>>>>    Location of the Row and any included columns. The index can be of 
>>>>>>> many
>>>>>>>    types, based on the Data. Iceberg tables are typically v.v.large. 
>>>>>>> Hence,
>>>>>>>    these Indexes also tend to be very large. Based on our past 1-2 
>>>>>>> years of
>>>>>>>    work in this space, we found that an Iceberg Table based Store 
>>>>>>> Secondary
>>>>>>>    Index - provides the right balance between the ability to skip over 
>>>>>>> and
>>>>>>>    load needed sections and yet provide the right performance benefits. 
>>>>>>> This
>>>>>>>    decision was also shaped by popular opinion from many of our 
>>>>>>> partners &
>>>>>>>    customers - as the Index computation involves a lot of computation, 
>>>>>>> they
>>>>>>>    want it to be in an open format, so that it can be shared with other
>>>>>>>    engines!
>>>>>>>    3. *Others: Full Text Search Indexes and Vector Indexes*: It is
>>>>>>>    critical that we allow years of innovation in the space of Full Text 
>>>>>>> Search
>>>>>>>    and Vector indexes, especially with the current acceleration in AI 
>>>>>>> adoption
>>>>>>>    & the need it is driving on the Keyword and Similarity Search space. 
>>>>>>> Given
>>>>>>>    that Iceberg tables are extremely large, it is critical for us to 
>>>>>>> provide a
>>>>>>>    good story for Indexes that can be incrementally updated / partially 
>>>>>>> loaded
>>>>>>>    into memory.
>>>>>>>
>>>>>>>
>>>>>>> Looking forward to the discussions.
>>>>>>>
>>>>>>> Best,
>>>>>>> Sreeram
>>>>>>>
>>>>>>> On Tue, Jul 15, 2025 at 9:33 AM Anurag Mantripragada
>>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks for starting this thread, Steven!
>>>>>>>>
>>>>>>>> I have been interested in secondary indexing in Iceberg. There was
>>>>>>>> an old proposal secondary indexing [1], we may need to revist/redesign
>>>>>>>> these structures. I agree this is a very broad topic and having 
>>>>>>>> indexing
>>>>>>>> structures general enough to support a wide range of use-cases will be 
>>>>>>>> a
>>>>>>>> key challenge.
>>>>>>>>
>>>>>>>> I would like to get involved any discussions related to indexing.
>>>>>>>>
>>>>>>>> [1] -
>>>>>>>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anurag Mantripragada
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jul 15, 2025, at 2:37 AM, Maximilian Michels <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Thanks Steven for the summary. It would be great to extend the
>>>>>>>> Iceberg spec with index files, such that they can be used for the 
>>>>>>>> different
>>>>>>>> use cases.
>>>>>>>>
>>>>>>>> For my understanding, let me further outline the different types of
>>>>>>>> use cases for index files:
>>>>>>>>
>>>>>>>> ---
>>>>>>>> Topic 1: Accelerating the resolution of equality deletes
>>>>>>>> ---
>>>>>>>>
>>>>>>>> In its current form, equality deletes make it impossible to achieve
>>>>>>>> proper merge-on-read performance in streaming reads, and they also add 
>>>>>>>> a
>>>>>>>> significant performance overhead in batch pipelines.
>>>>>>>>
>>>>>>>> Approach (a):
>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>>>>>> Converting equality deletes to positional deletes would be a great
>>>>>>>> achievement. I'm wondering though, if all engines will be able to 
>>>>>>>> achieve
>>>>>>>> this. There is quite some runtime complexity involved to achieve this. 
>>>>>>>> If I
>>>>>>>> understand correctly, the index can be bootstrapped via table 
>>>>>>>> maintenance
>>>>>>>> tasks, then has to be maintained by the streaming writer.
>>>>>>>>
>>>>>>>> Approach (b):
>>>>>>>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv
>>>>>>>> This would boost the resolution of equality deletes during reads
>>>>>>>> via indices. The indices can be built via maintenance tasks, or 
>>>>>>>> directly by
>>>>>>>> the writer as in (a). But how to keep the index fresh if we don't 
>>>>>>>> write the
>>>>>>>> index at the writers? Readers won't always be able to use an
>>>>>>>> up-to-date index, making this less suitable for streaming reads.
>>>>>>>>
>>>>>>>> ---
>>>>>>>> Topic 2: Full text search in table scans
>>>>>>>> ---
>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit
>>>>>>>> Adding full-text search would broaden Iceberg’s applicability,
>>>>>>>> enabling new search use cases and making table scans far more powerful.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Max
>>>>>>>>
>>>>>>>> On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Similar to other V4 threads, I am starting a thread to gauge
>>>>>>>>> interest in adding index support in Iceberg V4 and gather a focus 
>>>>>>>>> group in
>>>>>>>>> this area.
>>>>>>>>>
>>>>>>>>> There have been a few discussions related to indexing recently.
>>>>>>>>>
>>>>>>>>>    - Me and Peter Vary are working on a proposal (WIP) to
>>>>>>>>>    only write position deletes in the Flink streaming writer. It 
>>>>>>>>> would need a
>>>>>>>>>    primary key index to work reasonably efficiently. [1]
>>>>>>>>>    - Xiaoxuan Li has a proposal to leverage index files to
>>>>>>>>>    improve merge-on-read performance with equality deletes. [2]
>>>>>>>>>    - pengzhiwei has a proposal to support full-text index and
>>>>>>>>>    vector index. [3]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Idea: index files*
>>>>>>>>>
>>>>>>>>> To support those use cases, Iceberg can add support for index
>>>>>>>>> files (in addition to data files and delete files). It should be 
>>>>>>>>> general
>>>>>>>>> enough to support different forms of indexing.
>>>>>>>>>
>>>>>>>>>    - Primary key index
>>>>>>>>>    - Secondary index
>>>>>>>>>    - Full text index
>>>>>>>>>    - Vector index
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This email is a starting point. It is a large topic. A lot of
>>>>>>>>> discussions and maturation of the ideas are needed before a formal 
>>>>>>>>> proposal.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Steven
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>>>>>>> (WIP)
>>>>>>>>> [2]
>>>>>>>>> https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq
>>>>>>>>> [3] https://github.com/apache/iceberg/issues/12636
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>
> --
> Xinli Shang
>

Re: [DISCUSS] V4 - indexing support

Reply via email to