Thanks Steven for raising this topic and giving a summary on the proposals. I would like to get involved in this area.
On Fri, Oct 31, 2025 at 4:49 PM huaxin gao <[email protected]> wrote: > Thanks, Steven, for taking the initiative. I have previously collaborated > with Miao from Adobe on secondary index and would like to continue that > work. > > Huaxin > > On Fri, Oct 31, 2025 at 1:07 PM Xinli shang <[email protected]> > wrote: > >> Thanks Steven for proposing this! This is right direction to go. >> Definitely we see challenges in some cases without indexing support, >> especially around equality deletes and point lookups. I would like to >> contribute as well. One thing we need to be careful is that the overhead of >> the index itself like memory usage, index update etc. >> >> Namratha, for Parquet column index, we had one for Presto >> https://www.youtube.com/watch?v=fr_HdhMEa3s. >> >> >> >> >> On Fri, Oct 31, 2025 at 11:48 AM namratha mk <[email protected]> wrote: >> >>> Hi, >>> >>> I see the point in the doc : >>> >>> *The primary key index can also be useful for point lookup.* >>> But to achieve the above we would need to store native file format >>> metadata like parquet page index >>> <https://parquet.apache.org/docs/file-format/pageindex/> in the primary >>> index which helps in fetching for lookup use case. Has there been any talks >>> in the community about this? Would like to get more opinions on this. >>> >>> Thanks, >>> Namratha >>> >>> On Sat, Jul 19, 2025 at 2:39 AM Manish Malhotra < >>> [email protected]> wrote: >>> >>>> Thanks Steven, >>>> +1 on this initiative, I am also interested to contribute in this area. >>>> As you mentioned it has a quite a breadth, my though is we can start a >>>> document to discuss different layers separately like type of indexes, sync >>>> vs async, spec changes, priority of the index to be supported (instead of >>>> targeting all in one go) >>>> >>>> Thanks, >>>> Manish >>>> >>>> On Fri, Jul 18, 2025 at 10:41 PM Steven Wu <[email protected]> >>>> wrote: >>>> >>>>> Vignesh, that is yet to be discussed. We haven't got to that kind of >>>>> detail yet. >>>>> >>>>> In some cases, the index files are expected to be added along with the >>>>> data files in the same commit. Maybe some cases (like secondary index) >>>>> would prefer async process. >>>>> >>>>> On Fri, Jul 18, 2025 at 4:11 PM Vignesh <[email protected]> >>>>> wrote: >>>>> >>>>>> Are the index files for all kinds expected to be written and added >>>>>> along with data files or would it be an optional async step? >>>>>> >>>>>> On Fri, Jul 18, 2025, 5:09 AM Péter Váry <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> > *Primary Index*: Conventionally Primary Index - just means what >>>>>>> the Table's Primary storage layout/organization was. Given that Iceberg >>>>>>> supports Sort-order - if the Spec adds constraints to derive/influence >>>>>>> Sort >>>>>>> order based on the Identifier columns - it satisfies the Primary Index >>>>>>> criteria. >>>>>>> >>>>>>> Here is my mental model: >>>>>>> - Primary Key - the unique identifier for the rows >>>>>>> - Primary Key index - database index constructed on the Primary Key >>>>>>> column >>>>>>> - Iceberg sort order - performance optimization used to speed up >>>>>>> frequent, or costly queries. >>>>>>> >>>>>>> The Iceberg sort order is often defined above different columns than >>>>>>> the Primary Key, so I would try to avoid mixing the two concepts. >>>>>>> >>>>>>> > we found that an Iceberg Table based Store Secondary Index - >>>>>>> provides the right balance between the ability to skip over and load >>>>>>> needed >>>>>>> sections and yet provide the right performance benefits. >>>>>>> >>>>>>> Could you please elaborate on what "Iceberg Table based Store >>>>>>> Secondary Index" means? >>>>>>> Is this another Iceberg table with different columns and different >>>>>>> sort order? >>>>>>> >>>>>>> > they want it to be in an open format, so that it can be shared >>>>>>> with other engines! >>>>>>> >>>>>>> Wholeheartedly agreed! >>>>>>> >>>>>>> Thanks Steven for starting, and others for participating in the >>>>>>> discussion! >>>>>>> PEter >>>>>>> >>>>>>> Sreeram Garlapati <[email protected]> ezt írta (időpont: >>>>>>> 2025. júl. 15., K, 22:12): >>>>>>> >>>>>>>> Thanks Steven for starting this. >>>>>>>> >>>>>>>> I am interested in the - Index'ing related conversations. >>>>>>>> >>>>>>>> Here are some preliminary thoughts: >>>>>>>> >>>>>>>> 1. *Primary Index*: Conventionally Primary Index - just means >>>>>>>> what the Table's Primary storage layout/organization was. Given that >>>>>>>> Iceberg supports Sort-order - if the Spec adds constraints to >>>>>>>> derive/influence Sort order based on the Identifier columns - it >>>>>>>> satisfies >>>>>>>> the Primary Index criteria. >>>>>>>> 2. *Secondary Index*: Secondary Index storage calls for an >>>>>>>> efficient organization which can hold Secondary Keys along with the >>>>>>>> Location of the Row and any included columns. The index can be of >>>>>>>> many >>>>>>>> types, based on the Data. Iceberg tables are typically v.v.large. >>>>>>>> Hence, >>>>>>>> these Indexes also tend to be very large. Based on our past 1-2 >>>>>>>> years of >>>>>>>> work in this space, we found that an Iceberg Table based Store >>>>>>>> Secondary >>>>>>>> Index - provides the right balance between the ability to skip over >>>>>>>> and >>>>>>>> load needed sections and yet provide the right performance >>>>>>>> benefits. This >>>>>>>> decision was also shaped by popular opinion from many of our >>>>>>>> partners & >>>>>>>> customers - as the Index computation involves a lot of computation, >>>>>>>> they >>>>>>>> want it to be in an open format, so that it can be shared with other >>>>>>>> engines! >>>>>>>> 3. *Others: Full Text Search Indexes and Vector Indexes*: It is >>>>>>>> critical that we allow years of innovation in the space of Full >>>>>>>> Text Search >>>>>>>> and Vector indexes, especially with the current acceleration in AI >>>>>>>> adoption >>>>>>>> & the need it is driving on the Keyword and Similarity Search >>>>>>>> space. Given >>>>>>>> that Iceberg tables are extremely large, it is critical for us to >>>>>>>> provide a >>>>>>>> good story for Indexes that can be incrementally updated / >>>>>>>> partially loaded >>>>>>>> into memory. >>>>>>>> >>>>>>>> >>>>>>>> Looking forward to the discussions. >>>>>>>> >>>>>>>> Best, >>>>>>>> Sreeram >>>>>>>> >>>>>>>> On Tue, Jul 15, 2025 at 9:33 AM Anurag Mantripragada >>>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>>> Thanks for starting this thread, Steven! >>>>>>>>> >>>>>>>>> I have been interested in secondary indexing in Iceberg. There was >>>>>>>>> an old proposal secondary indexing [1], we may need to revist/redesign >>>>>>>>> these structures. I agree this is a very broad topic and having >>>>>>>>> indexing >>>>>>>>> structures general enough to support a wide range of use-cases will >>>>>>>>> be a >>>>>>>>> key challenge. >>>>>>>>> >>>>>>>>> I would like to get involved any discussions related to indexing. >>>>>>>>> >>>>>>>>> [1] - >>>>>>>>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Anurag Mantripragada >>>>>>>>> >>>>>>>>> >>>>>>>>> On Jul 15, 2025, at 2:37 AM, Maximilian Michels <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Thanks Steven for the summary. It would be great to extend the >>>>>>>>> Iceberg spec with index files, such that they can be used for the >>>>>>>>> different >>>>>>>>> use cases. >>>>>>>>> >>>>>>>>> For my understanding, let me further outline the different types >>>>>>>>> of use cases for index files: >>>>>>>>> >>>>>>>>> --- >>>>>>>>> Topic 1: Accelerating the resolution of equality deletes >>>>>>>>> --- >>>>>>>>> >>>>>>>>> In its current form, equality deletes make it impossible to >>>>>>>>> achieve proper merge-on-read performance in streaming reads, and they >>>>>>>>> also >>>>>>>>> add a significant performance overhead in batch pipelines. >>>>>>>>> >>>>>>>>> Approach (a): >>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ >>>>>>>>> Converting equality deletes to positional deletes would be a great >>>>>>>>> achievement. I'm wondering though, if all engines will be able to >>>>>>>>> achieve >>>>>>>>> this. There is quite some runtime complexity involved to achieve >>>>>>>>> this. If I >>>>>>>>> understand correctly, the index can be bootstrapped via table >>>>>>>>> maintenance >>>>>>>>> tasks, then has to be maintained by the streaming writer. >>>>>>>>> >>>>>>>>> Approach (b): >>>>>>>>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv >>>>>>>>> This would boost the resolution of equality deletes during reads >>>>>>>>> via indices. The indices can be built via maintenance tasks, or >>>>>>>>> directly by >>>>>>>>> the writer as in (a). But how to keep the index fresh if we don't >>>>>>>>> write the >>>>>>>>> index at the writers? Readers won't always be able to use an >>>>>>>>> up-to-date index, making this less suitable for streaming reads. >>>>>>>>> >>>>>>>>> --- >>>>>>>>> Topic 2: Full text search in table scans >>>>>>>>> --- >>>>>>>>> >>>>>>>>> >>>>>>>>> https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit >>>>>>>>> Adding full-text search would broaden Iceberg’s applicability, >>>>>>>>> enabling new search use cases and making table scans far more >>>>>>>>> powerful. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Max >>>>>>>>> >>>>>>>>> On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Similar to other V4 threads, I am starting a thread to gauge >>>>>>>>>> interest in adding index support in Iceberg V4 and gather a focus >>>>>>>>>> group in >>>>>>>>>> this area. >>>>>>>>>> >>>>>>>>>> There have been a few discussions related to indexing recently. >>>>>>>>>> >>>>>>>>>> - Me and Peter Vary are working on a proposal (WIP) to >>>>>>>>>> only write position deletes in the Flink streaming writer. It >>>>>>>>>> would need a >>>>>>>>>> primary key index to work reasonably efficiently. [1] >>>>>>>>>> - Xiaoxuan Li has a proposal to leverage index files to >>>>>>>>>> improve merge-on-read performance with equality deletes. [2] >>>>>>>>>> - pengzhiwei has a proposal to support full-text index and >>>>>>>>>> vector index. [3] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Idea: index files* >>>>>>>>>> >>>>>>>>>> To support those use cases, Iceberg can add support for index >>>>>>>>>> files (in addition to data files and delete files). It should be >>>>>>>>>> general >>>>>>>>>> enough to support different forms of indexing. >>>>>>>>>> >>>>>>>>>> - Primary key index >>>>>>>>>> - Secondary index >>>>>>>>>> - Full text index >>>>>>>>>> - Vector index >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> This email is a starting point. It is a large topic. A lot of >>>>>>>>>> discussions and maturation of the ideas are needed before a formal >>>>>>>>>> proposal. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Steven >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ >>>>>>>>>> (WIP) >>>>>>>>>> [2] >>>>>>>>>> https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq >>>>>>>>>> [3] https://github.com/apache/iceberg/issues/12636 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >> >> -- >> Xinli Shang >> >
