Re: [DISCUSS] V4 - indexing support

Steven Wu Fri, 18 Jul 2025 22:41:22 -0700

Vignesh, that is yet to be discussed. We haven't got to that kind of detail
yet.


In some cases, the index files are expected to be added along with the data
files in the same commit. Maybe some cases (like secondary index) would
prefer async process.

On Fri, Jul 18, 2025 at 4:11 PM Vignesh <vignesh.v...@gmail.com> wrote:

> Are the index files for all kinds expected to be written and added along
> with data files or would it be an optional async step?
>
> On Fri, Jul 18, 2025, 5:09 AM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> > *Primary Index*: Conventionally Primary Index - just means what the
>> Table's Primary storage layout/organization was. Given that Iceberg
>> supports Sort-order - if the Spec adds constraints to derive/influence Sort
>> order based on the Identifier columns - it satisfies the Primary Index
>> criteria.
>>
>> Here is my mental model:
>> - Primary Key - the unique identifier for the rows
>> - Primary Key index - database index constructed on the Primary Key column
>> - Iceberg sort order - performance optimization used to speed up
>> frequent, or costly queries.
>>
>> The Iceberg sort order is often defined above different columns than the
>> Primary Key, so I would try to avoid mixing the two concepts.
>>
>> > we found that an Iceberg Table based Store Secondary Index - provides
>> the right balance between the ability to skip over and load needed sections
>> and yet provide the right performance benefits.
>>
>> Could you please elaborate on what "Iceberg Table based Store Secondary
>> Index" means?
>> Is this another Iceberg table with different columns and different sort
>> order?
>>
>> > they want it to be in an open format, so that it can be shared with
>> other engines!
>>
>> Wholeheartedly agreed!
>>
>> Thanks Steven for starting, and others for participating in the
>> discussion!
>> PEter
>>
>> Sreeram Garlapati <gsreeramku...@gmail.com> ezt írta (időpont: 2025.
>> júl. 15., K, 22:12):
>>
>>> Thanks Steven for starting this.
>>>
>>> I am interested in the - Index'ing related conversations.
>>>
>>> Here are some preliminary thoughts:
>>>
>>>    1. *Primary Index*: Conventionally Primary Index - just means what
>>>    the Table's Primary storage layout/organization was. Given that Iceberg
>>>    supports Sort-order - if the Spec adds constraints to derive/influence 
>>> Sort
>>>    order based on the Identifier columns - it satisfies the Primary Index
>>>    criteria.
>>>    2. *Secondary Index*: Secondary Index storage calls for an efficient
>>>    organization which can hold Secondary Keys along with the Location of the
>>>    Row and any included columns. The index can be of many types, based on 
>>> the
>>>    Data. Iceberg tables are typically v.v.large. Hence, these Indexes also
>>>    tend to be very large. Based on our past 1-2 years of work in this space,
>>>    we found that an Iceberg Table based Store Secondary Index - provides the
>>>    right balance between the ability to skip over and load needed sections 
>>> and
>>>    yet provide the right performance benefits. This decision was also shaped
>>>    by popular opinion from many of our partners & customers - as the Index
>>>    computation involves a lot of computation, they want it to be in an open
>>>    format, so that it can be shared with other engines!
>>>    3. *Others: Full Text Search Indexes and Vector Indexes*: It is
>>>    critical that we allow years of innovation in the space of Full Text 
>>> Search
>>>    and Vector indexes, especially with the current acceleration in AI 
>>> adoption
>>>    & the need it is driving on the Keyword and Similarity Search space. 
>>> Given
>>>    that Iceberg tables are extremely large, it is critical for us to 
>>> provide a
>>>    good story for Indexes that can be incrementally updated / partially 
>>> loaded
>>>    into memory.
>>>
>>>
>>> Looking forward to the discussions.
>>>
>>> Best,
>>> Sreeram
>>>
>>> On Tue, Jul 15, 2025 at 9:33 AM Anurag Mantripragada
>>> <amantriprag...@apple.com.invalid> wrote:
>>>
>>>> Thanks for starting this thread, Steven!
>>>>
>>>> I have been interested in secondary indexing in Iceberg. There was an
>>>> old proposal secondary indexing [1], we may need to revist/redesign these
>>>> structures. I agree this is a very broad topic and having indexing
>>>> structures general enough to support a wide range of use-cases will be a
>>>> key challenge.
>>>>
>>>> I would like to get involved any discussions related to indexing.
>>>>
>>>> [1] -
>>>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0
>>>>
>>>>
>>>> Thanks,
>>>> Anurag Mantripragada
>>>>
>>>>
>>>> On Jul 15, 2025, at 2:37 AM, Maximilian Michels <m...@apache.org> wrote:
>>>>
>>>> Thanks Steven for the summary. It would be great to extend the Iceberg
>>>> spec with index files, such that they can be used for the different use
>>>> cases.
>>>>
>>>> For my understanding, let me further outline the different types of use
>>>> cases for index files:
>>>>
>>>> ---
>>>> Topic 1: Accelerating the resolution of equality deletes
>>>> ---
>>>>
>>>> In its current form, equality deletes make it impossible to achieve
>>>> proper merge-on-read performance in streaming reads, and they also add a
>>>> significant performance overhead in batch pipelines.
>>>>
>>>> Approach (a):
>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>> Converting equality deletes to positional deletes would be a great
>>>> achievement. I'm wondering though, if all engines will be able to achieve
>>>> this. There is quite some runtime complexity involved to achieve this. If I
>>>> understand correctly, the index can be bootstrapped via table maintenance
>>>> tasks, then has to be maintained by the streaming writer.
>>>>
>>>> Approach (b):
>>>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv
>>>> This would boost the resolution of equality deletes during reads via
>>>> indices. The indices can be built via maintenance tasks, or directly by the
>>>> writer as in (a). But how to keep the index fresh if we don't write the
>>>> index at the writers? Readers won't always be able to use an
>>>> up-to-date index, making this less suitable for streaming reads.
>>>>
>>>> ---
>>>> Topic 2: Full text search in table scans
>>>> ---
>>>>
>>>>
>>>> https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit
>>>> Adding full-text search would broaden Iceberg’s applicability, enabling
>>>> new search use cases and making table scans far more powerful.
>>>>
>>>> Cheers,
>>>> Max
>>>>
>>>> On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <stevenz...@gmail.com> wrote:
>>>>
>>>>>
>>>>> Similar to other V4 threads, I am starting a thread to gauge interest
>>>>> in adding index support in Iceberg V4 and gather a focus group in this 
>>>>> area.
>>>>>
>>>>> There have been a few discussions related to indexing recently.
>>>>>
>>>>>    - Me and Peter Vary are working on a proposal (WIP) to only write
>>>>>    position deletes in the Flink streaming writer. It would need a 
>>>>> primary key
>>>>>    index to work reasonably efficiently. [1]
>>>>>    - Xiaoxuan Li has a proposal to leverage index files to improve
>>>>>    merge-on-read performance with equality deletes. [2]
>>>>>    - pengzhiwei has a proposal to support full-text index and vector
>>>>>    index. [3]
>>>>>
>>>>>
>>>>> *Idea: index files*
>>>>>
>>>>> To support those use cases, Iceberg can add support for index files
>>>>> (in addition to data files and delete files). It should be general enough
>>>>> to support different forms of indexing.
>>>>>
>>>>>    - Primary key index
>>>>>    - Secondary index
>>>>>    - Full text index
>>>>>    - Vector index
>>>>>
>>>>>
>>>>> This email is a starting point. It is a large topic. A lot of
>>>>> discussions and maturation of the ideas are needed before a formal 
>>>>> proposal.
>>>>>
>>>>> Thanks,
>>>>> Steven
>>>>>
>>>>> [1]
>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>>> (WIP)
>>>>> [2] https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq
>>>>> [3] https://github.com/apache/iceberg/issues/12636
>>>>>
>>>>>
>>>>>
>>>>

Re: [DISCUSS] V4 - indexing support

Reply via email to