Thanks, Steven, for taking the initiative. I have previously collaborated with Miao from Adobe on secondary index and would like to continue that work.
Huaxin On Fri, Oct 31, 2025 at 1:07 PM Xinli shang <[email protected]> wrote: > Thanks Steven for proposing this! This is right direction to go. > Definitely we see challenges in some cases without indexing support, > especially around equality deletes and point lookups. I would like to > contribute as well. One thing we need to be careful is that the overhead of > the index itself like memory usage, index update etc. > > Namratha, for Parquet column index, we had one for Presto > https://www.youtube.com/watch?v=fr_HdhMEa3s. > > > > > On Fri, Oct 31, 2025 at 11:48 AM namratha mk <[email protected]> wrote: > >> Hi, >> >> I see the point in the doc : >> >> *The primary key index can also be useful for point lookup.* >> But to achieve the above we would need to store native file format >> metadata like parquet page index >> <https://parquet.apache.org/docs/file-format/pageindex/> in the primary >> index which helps in fetching for lookup use case. Has there been any talks >> in the community about this? Would like to get more opinions on this. >> >> Thanks, >> Namratha >> >> On Sat, Jul 19, 2025 at 2:39 AM Manish Malhotra < >> [email protected]> wrote: >> >>> Thanks Steven, >>> +1 on this initiative, I am also interested to contribute in this area. >>> As you mentioned it has a quite a breadth, my though is we can start a >>> document to discuss different layers separately like type of indexes, sync >>> vs async, spec changes, priority of the index to be supported (instead of >>> targeting all in one go) >>> >>> Thanks, >>> Manish >>> >>> On Fri, Jul 18, 2025 at 10:41 PM Steven Wu <[email protected]> wrote: >>> >>>> Vignesh, that is yet to be discussed. We haven't got to that kind of >>>> detail yet. >>>> >>>> In some cases, the index files are expected to be added along with the >>>> data files in the same commit. Maybe some cases (like secondary index) >>>> would prefer async process. >>>> >>>> On Fri, Jul 18, 2025 at 4:11 PM Vignesh <[email protected]> wrote: >>>> >>>>> Are the index files for all kinds expected to be written and added >>>>> along with data files or would it be an optional async step? >>>>> >>>>> On Fri, Jul 18, 2025, 5:09 AM Péter Váry <[email protected]> >>>>> wrote: >>>>> >>>>>> > *Primary Index*: Conventionally Primary Index - just means what >>>>>> the Table's Primary storage layout/organization was. Given that Iceberg >>>>>> supports Sort-order - if the Spec adds constraints to derive/influence >>>>>> Sort >>>>>> order based on the Identifier columns - it satisfies the Primary Index >>>>>> criteria. >>>>>> >>>>>> Here is my mental model: >>>>>> - Primary Key - the unique identifier for the rows >>>>>> - Primary Key index - database index constructed on the Primary Key >>>>>> column >>>>>> - Iceberg sort order - performance optimization used to speed up >>>>>> frequent, or costly queries. >>>>>> >>>>>> The Iceberg sort order is often defined above different columns than >>>>>> the Primary Key, so I would try to avoid mixing the two concepts. >>>>>> >>>>>> > we found that an Iceberg Table based Store Secondary Index - >>>>>> provides the right balance between the ability to skip over and load >>>>>> needed >>>>>> sections and yet provide the right performance benefits. >>>>>> >>>>>> Could you please elaborate on what "Iceberg Table based Store >>>>>> Secondary Index" means? >>>>>> Is this another Iceberg table with different columns and different >>>>>> sort order? >>>>>> >>>>>> > they want it to be in an open format, so that it can be shared with >>>>>> other engines! >>>>>> >>>>>> Wholeheartedly agreed! >>>>>> >>>>>> Thanks Steven for starting, and others for participating in the >>>>>> discussion! >>>>>> PEter >>>>>> >>>>>> Sreeram Garlapati <[email protected]> ezt írta (időpont: 2025. >>>>>> júl. 15., K, 22:12): >>>>>> >>>>>>> Thanks Steven for starting this. >>>>>>> >>>>>>> I am interested in the - Index'ing related conversations. >>>>>>> >>>>>>> Here are some preliminary thoughts: >>>>>>> >>>>>>> 1. *Primary Index*: Conventionally Primary Index - just means >>>>>>> what the Table's Primary storage layout/organization was. Given that >>>>>>> Iceberg supports Sort-order - if the Spec adds constraints to >>>>>>> derive/influence Sort order based on the Identifier columns - it >>>>>>> satisfies >>>>>>> the Primary Index criteria. >>>>>>> 2. *Secondary Index*: Secondary Index storage calls for an >>>>>>> efficient organization which can hold Secondary Keys along with the >>>>>>> Location of the Row and any included columns. The index can be of >>>>>>> many >>>>>>> types, based on the Data. Iceberg tables are typically v.v.large. >>>>>>> Hence, >>>>>>> these Indexes also tend to be very large. Based on our past 1-2 >>>>>>> years of >>>>>>> work in this space, we found that an Iceberg Table based Store >>>>>>> Secondary >>>>>>> Index - provides the right balance between the ability to skip over >>>>>>> and >>>>>>> load needed sections and yet provide the right performance benefits. >>>>>>> This >>>>>>> decision was also shaped by popular opinion from many of our >>>>>>> partners & >>>>>>> customers - as the Index computation involves a lot of computation, >>>>>>> they >>>>>>> want it to be in an open format, so that it can be shared with other >>>>>>> engines! >>>>>>> 3. *Others: Full Text Search Indexes and Vector Indexes*: It is >>>>>>> critical that we allow years of innovation in the space of Full Text >>>>>>> Search >>>>>>> and Vector indexes, especially with the current acceleration in AI >>>>>>> adoption >>>>>>> & the need it is driving on the Keyword and Similarity Search space. >>>>>>> Given >>>>>>> that Iceberg tables are extremely large, it is critical for us to >>>>>>> provide a >>>>>>> good story for Indexes that can be incrementally updated / partially >>>>>>> loaded >>>>>>> into memory. >>>>>>> >>>>>>> >>>>>>> Looking forward to the discussions. >>>>>>> >>>>>>> Best, >>>>>>> Sreeram >>>>>>> >>>>>>> On Tue, Jul 15, 2025 at 9:33 AM Anurag Mantripragada >>>>>>> <[email protected]> wrote: >>>>>>> >>>>>>>> Thanks for starting this thread, Steven! >>>>>>>> >>>>>>>> I have been interested in secondary indexing in Iceberg. There was >>>>>>>> an old proposal secondary indexing [1], we may need to revist/redesign >>>>>>>> these structures. I agree this is a very broad topic and having >>>>>>>> indexing >>>>>>>> structures general enough to support a wide range of use-cases will be >>>>>>>> a >>>>>>>> key challenge. >>>>>>>> >>>>>>>> I would like to get involved any discussions related to indexing. >>>>>>>> >>>>>>>> [1] - >>>>>>>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0 >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Anurag Mantripragada >>>>>>>> >>>>>>>> >>>>>>>> On Jul 15, 2025, at 2:37 AM, Maximilian Michels <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Thanks Steven for the summary. It would be great to extend the >>>>>>>> Iceberg spec with index files, such that they can be used for the >>>>>>>> different >>>>>>>> use cases. >>>>>>>> >>>>>>>> For my understanding, let me further outline the different types of >>>>>>>> use cases for index files: >>>>>>>> >>>>>>>> --- >>>>>>>> Topic 1: Accelerating the resolution of equality deletes >>>>>>>> --- >>>>>>>> >>>>>>>> In its current form, equality deletes make it impossible to achieve >>>>>>>> proper merge-on-read performance in streaming reads, and they also add >>>>>>>> a >>>>>>>> significant performance overhead in batch pipelines. >>>>>>>> >>>>>>>> Approach (a): >>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ >>>>>>>> Converting equality deletes to positional deletes would be a great >>>>>>>> achievement. I'm wondering though, if all engines will be able to >>>>>>>> achieve >>>>>>>> this. There is quite some runtime complexity involved to achieve this. >>>>>>>> If I >>>>>>>> understand correctly, the index can be bootstrapped via table >>>>>>>> maintenance >>>>>>>> tasks, then has to be maintained by the streaming writer. >>>>>>>> >>>>>>>> Approach (b): >>>>>>>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv >>>>>>>> This would boost the resolution of equality deletes during reads >>>>>>>> via indices. The indices can be built via maintenance tasks, or >>>>>>>> directly by >>>>>>>> the writer as in (a). But how to keep the index fresh if we don't >>>>>>>> write the >>>>>>>> index at the writers? Readers won't always be able to use an >>>>>>>> up-to-date index, making this less suitable for streaming reads. >>>>>>>> >>>>>>>> --- >>>>>>>> Topic 2: Full text search in table scans >>>>>>>> --- >>>>>>>> >>>>>>>> >>>>>>>> https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit >>>>>>>> Adding full-text search would broaden Iceberg’s applicability, >>>>>>>> enabling new search use cases and making table scans far more powerful. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Max >>>>>>>> >>>>>>>> On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> Similar to other V4 threads, I am starting a thread to gauge >>>>>>>>> interest in adding index support in Iceberg V4 and gather a focus >>>>>>>>> group in >>>>>>>>> this area. >>>>>>>>> >>>>>>>>> There have been a few discussions related to indexing recently. >>>>>>>>> >>>>>>>>> - Me and Peter Vary are working on a proposal (WIP) to >>>>>>>>> only write position deletes in the Flink streaming writer. It >>>>>>>>> would need a >>>>>>>>> primary key index to work reasonably efficiently. [1] >>>>>>>>> - Xiaoxuan Li has a proposal to leverage index files to >>>>>>>>> improve merge-on-read performance with equality deletes. [2] >>>>>>>>> - pengzhiwei has a proposal to support full-text index and >>>>>>>>> vector index. [3] >>>>>>>>> >>>>>>>>> >>>>>>>>> *Idea: index files* >>>>>>>>> >>>>>>>>> To support those use cases, Iceberg can add support for index >>>>>>>>> files (in addition to data files and delete files). It should be >>>>>>>>> general >>>>>>>>> enough to support different forms of indexing. >>>>>>>>> >>>>>>>>> - Primary key index >>>>>>>>> - Secondary index >>>>>>>>> - Full text index >>>>>>>>> - Vector index >>>>>>>>> >>>>>>>>> >>>>>>>>> This email is a starting point. It is a large topic. A lot of >>>>>>>>> discussions and maturation of the ideas are needed before a formal >>>>>>>>> proposal. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Steven >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ >>>>>>>>> (WIP) >>>>>>>>> [2] >>>>>>>>> https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq >>>>>>>>> [3] https://github.com/apache/iceberg/issues/12636 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> > > -- > Xinli Shang >
