Are the index files for all kinds expected to be written and added along
with data files or would it be an optional async step?

On Fri, Jul 18, 2025, 5:09 AM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> > *Primary Index*: Conventionally Primary Index - just means what the
> Table's Primary storage layout/organization was. Given that Iceberg
> supports Sort-order - if the Spec adds constraints to derive/influence Sort
> order based on the Identifier columns - it satisfies the Primary Index
> criteria.
>
> Here is my mental model:
> - Primary Key - the unique identifier for the rows
> - Primary Key index - database index constructed on the Primary Key column
> - Iceberg sort order - performance optimization used to speed up frequent,
> or costly queries.
>
> The Iceberg sort order is often defined above different columns than the
> Primary Key, so I would try to avoid mixing the two concepts.
>
> > we found that an Iceberg Table based Store Secondary Index - provides
> the right balance between the ability to skip over and load needed sections
> and yet provide the right performance benefits.
>
> Could you please elaborate on what "Iceberg Table based Store Secondary
> Index" means?
> Is this another Iceberg table with different columns and different sort
> order?
>
> > they want it to be in an open format, so that it can be shared with
> other engines!
>
> Wholeheartedly agreed!
>
> Thanks Steven for starting, and others for participating in the discussion!
> PEter
>
> Sreeram Garlapati <gsreeramku...@gmail.com> ezt írta (időpont: 2025. júl.
> 15., K, 22:12):
>
>> Thanks Steven for starting this.
>>
>> I am interested in the - Index'ing related conversations.
>>
>> Here are some preliminary thoughts:
>>
>>    1. *Primary Index*: Conventionally Primary Index - just means what
>>    the Table's Primary storage layout/organization was. Given that Iceberg
>>    supports Sort-order - if the Spec adds constraints to derive/influence 
>> Sort
>>    order based on the Identifier columns - it satisfies the Primary Index
>>    criteria.
>>    2. *Secondary Index*: Secondary Index storage calls for an efficient
>>    organization which can hold Secondary Keys along with the Location of the
>>    Row and any included columns. The index can be of many types, based on the
>>    Data. Iceberg tables are typically v.v.large. Hence, these Indexes also
>>    tend to be very large. Based on our past 1-2 years of work in this space,
>>    we found that an Iceberg Table based Store Secondary Index - provides the
>>    right balance between the ability to skip over and load needed sections 
>> and
>>    yet provide the right performance benefits. This decision was also shaped
>>    by popular opinion from many of our partners & customers - as the Index
>>    computation involves a lot of computation, they want it to be in an open
>>    format, so that it can be shared with other engines!
>>    3. *Others: Full Text Search Indexes and Vector Indexes*: It is
>>    critical that we allow years of innovation in the space of Full Text 
>> Search
>>    and Vector indexes, especially with the current acceleration in AI 
>> adoption
>>    & the need it is driving on the Keyword and Similarity Search space. Given
>>    that Iceberg tables are extremely large, it is critical for us to provide 
>> a
>>    good story for Indexes that can be incrementally updated / partially 
>> loaded
>>    into memory.
>>
>>
>> Looking forward to the discussions.
>>
>> Best,
>> Sreeram
>>
>> On Tue, Jul 15, 2025 at 9:33 AM Anurag Mantripragada
>> <amantriprag...@apple.com.invalid> wrote:
>>
>>> Thanks for starting this thread, Steven!
>>>
>>> I have been interested in secondary indexing in Iceberg. There was an
>>> old proposal secondary indexing [1], we may need to revist/redesign these
>>> structures. I agree this is a very broad topic and having indexing
>>> structures general enough to support a wide range of use-cases will be a
>>> key challenge.
>>>
>>> I would like to get involved any discussions related to indexing.
>>>
>>> [1] -
>>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0
>>>
>>>
>>> Thanks,
>>> Anurag Mantripragada
>>>
>>>
>>> On Jul 15, 2025, at 2:37 AM, Maximilian Michels <m...@apache.org> wrote:
>>>
>>> Thanks Steven for the summary. It would be great to extend the Iceberg
>>> spec with index files, such that they can be used for the different use
>>> cases.
>>>
>>> For my understanding, let me further outline the different types of use
>>> cases for index files:
>>>
>>> ---
>>> Topic 1: Accelerating the resolution of equality deletes
>>> ---
>>>
>>> In its current form, equality deletes make it impossible to achieve
>>> proper merge-on-read performance in streaming reads, and they also add a
>>> significant performance overhead in batch pipelines.
>>>
>>> Approach (a):
>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>> Converting equality deletes to positional deletes would be a great
>>> achievement. I'm wondering though, if all engines will be able to achieve
>>> this. There is quite some runtime complexity involved to achieve this. If I
>>> understand correctly, the index can be bootstrapped via table maintenance
>>> tasks, then has to be maintained by the streaming writer.
>>>
>>> Approach (b):
>>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv
>>> This would boost the resolution of equality deletes during reads via
>>> indices. The indices can be built via maintenance tasks, or directly by the
>>> writer as in (a). But how to keep the index fresh if we don't write the
>>> index at the writers? Readers won't always be able to use an
>>> up-to-date index, making this less suitable for streaming reads.
>>>
>>> ---
>>> Topic 2: Full text search in table scans
>>> ---
>>>
>>>
>>> https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit
>>> Adding full-text search would broaden Iceberg’s applicability, enabling
>>> new search use cases and making table scans far more powerful.
>>>
>>> Cheers,
>>> Max
>>>
>>> On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>>>
>>>> Similar to other V4 threads, I am starting a thread to gauge interest
>>>> in adding index support in Iceberg V4 and gather a focus group in this 
>>>> area.
>>>>
>>>> There have been a few discussions related to indexing recently.
>>>>
>>>>    - Me and Peter Vary are working on a proposal (WIP) to only write
>>>>    position deletes in the Flink streaming writer. It would need a primary 
>>>> key
>>>>    index to work reasonably efficiently. [1]
>>>>    - Xiaoxuan Li has a proposal to leverage index files to improve
>>>>    merge-on-read performance with equality deletes. [2]
>>>>    - pengzhiwei has a proposal to support full-text index and vector
>>>>    index. [3]
>>>>
>>>>
>>>> *Idea: index files*
>>>>
>>>> To support those use cases, Iceberg can add support for index files (in
>>>> addition to data files and delete files). It should be general enough to
>>>> support different forms of indexing.
>>>>
>>>>    - Primary key index
>>>>    - Secondary index
>>>>    - Full text index
>>>>    - Vector index
>>>>
>>>>
>>>> This email is a starting point. It is a large topic. A lot of
>>>> discussions and maturation of the ideas are needed before a formal 
>>>> proposal.
>>>>
>>>> Thanks,
>>>> Steven
>>>>
>>>> [1]
>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>> (WIP)
>>>> [2] https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq
>>>> [3] https://github.com/apache/iceberg/issues/12636
>>>>
>>>>
>>>>
>>>

Reply via email to