Re: [DISCUSS] V4 - indexing support

Péter Váry Tue, 18 Nov 2025 03:32:46 -0800

Hi Team,

Do we have any progress on this topic? I’d really like to see this move
forward.


Following Sreeram’s suggestion, we should start collecting the key use
cases we want to support with indexes. Here’s what I’ve heard so far:

   - *Primary key index*
      - Find a single or few rows by a given primary key
      - Build the Flink “primary key → file_name, position” state by bulk
      reading the primary key index
   - *Secondary index*
      - Range or min/max filtering on columns that are not part of the
      primary key (primary sort order)
   - *Full-text index*
      - Term search in text columns
   - *Vector index*
      - Nearest or approximate nearest neighbor search
   - *Geospatial index*
      - Finding points within a polygon or nearest location

We should identify a few critical use cases and keep the others in mind
when designing how we store, retrieve, and use these indexes. Personally,
I’d love to see *vector indexes in Iceberg*, enabling fast AI searches on
Iceberg tables.

For reference, I asked Copilot to collect the currently available index
types in MSSQL, Oracle, Postgres, MySQL, and LanceDB. Here’s the list:
https://docs.google.com/spreadsheets/d/14cBdwsOw89ivolHtAw342YNoGmb1-Kri1E80hwWymL0Thanks
,

Peter


Aihua Xu <[email protected]> ezt írta (időpont: 2025. nov. 2., V, 4:11):

> Thanks Steven for raising this topic and giving a summary on the
> proposals. I would like to get involved in this area.
>
> On Fri, Oct 31, 2025 at 4:49 PM huaxin gao <[email protected]> wrote:
>
>> Thanks, Steven, for taking the initiative. I have previously collaborated
>> with Miao from Adobe on secondary index and would like to continue that
>> work.
>>
>> Huaxin
>>
>> On Fri, Oct 31, 2025 at 1:07 PM Xinli shang <[email protected]>
>> wrote:
>>
>>> Thanks Steven for proposing this! This is right direction to go.
>>> Definitely we see challenges in some cases without indexing support,
>>> especially around equality deletes and point lookups. I would like to
>>> contribute as well. One thing we need to be careful is that the overhead of
>>> the index itself like memory usage, index update etc.
>>>
>>> Namratha, for Parquet column index, we had one for Presto
>>> https://www.youtube.com/watch?v=fr_HdhMEa3s.
>>>
>>>
>>>
>>>
>>> On Fri, Oct 31, 2025 at 11:48 AM namratha mk <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I see the point in the doc :
>>>>
>>>> *The primary key index can also be useful for point lookup.*
>>>> But to achieve the above we would need to store native file format
>>>> metadata like parquet page index
>>>> <https://parquet.apache.org/docs/file-format/pageindex/> in the
>>>> primary index which helps in fetching for lookup use case. Has there been
>>>> any talks in the community about this? Would like to get more opinions on
>>>> this.
>>>>
>>>> Thanks,
>>>> Namratha
>>>>
>>>> On Sat, Jul 19, 2025 at 2:39 AM Manish Malhotra <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks Steven,
>>>>> +1 on this initiative, I am also interested to contribute in this
>>>>> area.
>>>>> As you mentioned it has a quite a breadth, my though is we can start a
>>>>> document to  discuss different layers separately like type of indexes, 
>>>>> sync
>>>>> vs async, spec changes, priority of the index to be supported (instead of
>>>>> targeting all in one go)
>>>>>
>>>>> Thanks,
>>>>> Manish
>>>>>
>>>>> On Fri, Jul 18, 2025 at 10:41 PM Steven Wu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Vignesh, that is yet to be discussed. We haven't got to that kind of
>>>>>> detail yet.
>>>>>>
>>>>>> In some cases, the index files are expected to be added along with
>>>>>> the data files in the same commit. Maybe some cases (like secondary 
>>>>>> index)
>>>>>> would prefer async process.
>>>>>>
>>>>>> On Fri, Jul 18, 2025 at 4:11 PM Vignesh <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Are the index files for all kinds expected to be written and added
>>>>>>> along with data files or would it be an optional async step?
>>>>>>>
>>>>>>> On Fri, Jul 18, 2025, 5:09 AM Péter Váry <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> > *Primary Index*: Conventionally Primary Index - just means what
>>>>>>>> the Table's Primary storage layout/organization was. Given that Iceberg
>>>>>>>> supports Sort-order - if the Spec adds constraints to derive/influence 
>>>>>>>> Sort
>>>>>>>> order based on the Identifier columns - it satisfies the Primary Index
>>>>>>>> criteria.
>>>>>>>>
>>>>>>>> Here is my mental model:
>>>>>>>> - Primary Key - the unique identifier for the rows
>>>>>>>> - Primary Key index - database index constructed on the Primary Key
>>>>>>>> column
>>>>>>>> - Iceberg sort order - performance optimization used to speed up
>>>>>>>> frequent, or costly queries.
>>>>>>>>
>>>>>>>> The Iceberg sort order is often defined above different columns
>>>>>>>> than the Primary Key, so I would try to avoid mixing the two concepts.
>>>>>>>>
>>>>>>>> > we found that an Iceberg Table based Store Secondary Index -
>>>>>>>> provides the right balance between the ability to skip over and load 
>>>>>>>> needed
>>>>>>>> sections and yet provide the right performance benefits.
>>>>>>>>
>>>>>>>> Could you please elaborate on what "Iceberg Table based Store
>>>>>>>> Secondary Index" means?
>>>>>>>> Is this another Iceberg table with different columns and different
>>>>>>>> sort order?
>>>>>>>>
>>>>>>>> > they want it to be in an open format, so that it can be shared
>>>>>>>> with other engines!
>>>>>>>>
>>>>>>>> Wholeheartedly agreed!
>>>>>>>>
>>>>>>>> Thanks Steven for starting, and others for participating in the
>>>>>>>> discussion!
>>>>>>>> PEter
>>>>>>>>
>>>>>>>> Sreeram Garlapati <[email protected]> ezt írta (időpont:
>>>>>>>> 2025. júl. 15., K, 22:12):
>>>>>>>>
>>>>>>>>> Thanks Steven for starting this.
>>>>>>>>>
>>>>>>>>> I am interested in the - Index'ing related conversations.
>>>>>>>>>
>>>>>>>>> Here are some preliminary thoughts:
>>>>>>>>>
>>>>>>>>>    1. *Primary Index*: Conventionally Primary Index - just means
>>>>>>>>>    what the Table's Primary storage layout/organization was. Given 
>>>>>>>>> that
>>>>>>>>>    Iceberg supports Sort-order - if the Spec adds constraints to
>>>>>>>>>    derive/influence Sort order based on the Identifier columns - it 
>>>>>>>>> satisfies
>>>>>>>>>    the Primary Index criteria.
>>>>>>>>>    2. *Secondary Index*: Secondary Index storage calls for an
>>>>>>>>>    efficient organization which can hold Secondary Keys along with the
>>>>>>>>>    Location of the Row and any included columns. The index can be of 
>>>>>>>>> many
>>>>>>>>>    types, based on the Data. Iceberg tables are typically v.v.large. 
>>>>>>>>> Hence,
>>>>>>>>>    these Indexes also tend to be very large. Based on our past 1-2 
>>>>>>>>> years of
>>>>>>>>>    work in this space, we found that an Iceberg Table based Store 
>>>>>>>>> Secondary
>>>>>>>>>    Index - provides the right balance between the ability to skip 
>>>>>>>>> over and
>>>>>>>>>    load needed sections and yet provide the right performance 
>>>>>>>>> benefits. This
>>>>>>>>>    decision was also shaped by popular opinion from many of our 
>>>>>>>>> partners &
>>>>>>>>>    customers - as the Index computation involves a lot of 
>>>>>>>>> computation, they
>>>>>>>>>    want it to be in an open format, so that it can be shared with 
>>>>>>>>> other
>>>>>>>>>    engines!
>>>>>>>>>    3. *Others: Full Text Search Indexes and Vector Indexes*: It
>>>>>>>>>    is critical that we allow years of innovation in the space of Full 
>>>>>>>>> Text
>>>>>>>>>    Search and Vector indexes, especially with the current 
>>>>>>>>> acceleration in AI
>>>>>>>>>    adoption & the need it is driving on the Keyword and Similarity 
>>>>>>>>> Search
>>>>>>>>>    space. Given that Iceberg tables are extremely large, it is 
>>>>>>>>> critical for us
>>>>>>>>>    to provide a good story for Indexes that can be incrementally 
>>>>>>>>> updated /
>>>>>>>>>    partially loaded into memory.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Looking forward to the discussions.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Sreeram
>>>>>>>>>
>>>>>>>>> On Tue, Jul 15, 2025 at 9:33 AM Anurag Mantripragada
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for starting this thread, Steven!
>>>>>>>>>>
>>>>>>>>>> I have been interested in secondary indexing in Iceberg. There
>>>>>>>>>> was an old proposal secondary indexing [1], we may need to 
>>>>>>>>>> revist/redesign
>>>>>>>>>> these structures. I agree this is a very broad topic and having 
>>>>>>>>>> indexing
>>>>>>>>>> structures general enough to support a wide range of use-cases will 
>>>>>>>>>> be a
>>>>>>>>>> key challenge.
>>>>>>>>>>
>>>>>>>>>> I would like to get involved any discussions related to indexing.
>>>>>>>>>>
>>>>>>>>>> [1] -
>>>>>>>>>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Anurag Mantripragada
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Jul 15, 2025, at 2:37 AM, Maximilian Michels <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Thanks Steven for the summary. It would be great to extend the
>>>>>>>>>> Iceberg spec with index files, such that they can be used for the 
>>>>>>>>>> different
>>>>>>>>>> use cases.
>>>>>>>>>>
>>>>>>>>>> For my understanding, let me further outline the different types
>>>>>>>>>> of use cases for index files:
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> Topic 1: Accelerating the resolution of equality deletes
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> In its current form, equality deletes make it impossible to
>>>>>>>>>> achieve proper merge-on-read performance in streaming reads, and 
>>>>>>>>>> they also
>>>>>>>>>> add a significant performance overhead in batch pipelines.
>>>>>>>>>>
>>>>>>>>>> Approach (a):
>>>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>>>>>>>> Converting equality deletes to positional deletes would be a
>>>>>>>>>> great achievement. I'm wondering though, if all engines will be able 
>>>>>>>>>> to
>>>>>>>>>> achieve this. There is quite some runtime complexity involved to 
>>>>>>>>>> achieve
>>>>>>>>>> this. If I understand correctly, the index can be bootstrapped via 
>>>>>>>>>> table
>>>>>>>>>> maintenance tasks, then has to be maintained by the streaming writer.
>>>>>>>>>>
>>>>>>>>>> Approach (b):
>>>>>>>>>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv
>>>>>>>>>> This would boost the resolution of equality deletes during reads
>>>>>>>>>> via indices. The indices can be built via maintenance tasks, or 
>>>>>>>>>> directly by
>>>>>>>>>> the writer as in (a). But how to keep the index fresh if we don't 
>>>>>>>>>> write the
>>>>>>>>>> index at the writers? Readers won't always be able to use an
>>>>>>>>>> up-to-date index, making this less suitable for streaming reads.
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> Topic 2: Full text search in table scans
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit
>>>>>>>>>> Adding full-text search would broaden Iceberg’s applicability,
>>>>>>>>>> enabling new search use cases and making table scans far more 
>>>>>>>>>> powerful.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Max
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Similar to other V4 threads, I am starting a thread to gauge
>>>>>>>>>>> interest in adding index support in Iceberg V4 and gather a focus 
>>>>>>>>>>> group in
>>>>>>>>>>> this area.
>>>>>>>>>>>
>>>>>>>>>>> There have been a few discussions related to indexing recently.
>>>>>>>>>>>
>>>>>>>>>>>    - Me and Peter Vary are working on a proposal (WIP) to
>>>>>>>>>>>    only write position deletes in the Flink streaming writer. It 
>>>>>>>>>>> would need a
>>>>>>>>>>>    primary key index to work reasonably efficiently. [1]
>>>>>>>>>>>    - Xiaoxuan Li has a proposal to leverage index files to
>>>>>>>>>>>    improve merge-on-read performance with equality deletes. [2]
>>>>>>>>>>>    - pengzhiwei has a proposal to support full-text index and
>>>>>>>>>>>    vector index. [3]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Idea: index files*
>>>>>>>>>>>
>>>>>>>>>>> To support those use cases, Iceberg can add support for index
>>>>>>>>>>> files (in addition to data files and delete files). It should be 
>>>>>>>>>>> general
>>>>>>>>>>> enough to support different forms of indexing.
>>>>>>>>>>>
>>>>>>>>>>>    - Primary key index
>>>>>>>>>>>    - Secondary index
>>>>>>>>>>>    - Full text index
>>>>>>>>>>>    - Vector index
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This email is a starting point. It is a large topic. A lot of
>>>>>>>>>>> discussions and maturation of the ideas are needed before a formal 
>>>>>>>>>>> proposal.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Steven
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/
>>>>>>>>>>> (WIP)
>>>>>>>>>>> [2]
>>>>>>>>>>> https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq
>>>>>>>>>>> [3] https://github.com/apache/iceberg/issues/12636
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>
>>> --
>>> Xinli Shang
>>>
>>

Re: [DISCUSS] V4 - indexing support

Reply via email to