Are the index files for all kinds expected to be written and added along with data files or would it be an optional async step?
On Fri, Jul 18, 2025, 5:09 AM Péter Váry <peter.vary.apa...@gmail.com> wrote: > > *Primary Index*: Conventionally Primary Index - just means what the > Table's Primary storage layout/organization was. Given that Iceberg > supports Sort-order - if the Spec adds constraints to derive/influence Sort > order based on the Identifier columns - it satisfies the Primary Index > criteria. > > Here is my mental model: > - Primary Key - the unique identifier for the rows > - Primary Key index - database index constructed on the Primary Key column > - Iceberg sort order - performance optimization used to speed up frequent, > or costly queries. > > The Iceberg sort order is often defined above different columns than the > Primary Key, so I would try to avoid mixing the two concepts. > > > we found that an Iceberg Table based Store Secondary Index - provides > the right balance between the ability to skip over and load needed sections > and yet provide the right performance benefits. > > Could you please elaborate on what "Iceberg Table based Store Secondary > Index" means? > Is this another Iceberg table with different columns and different sort > order? > > > they want it to be in an open format, so that it can be shared with > other engines! > > Wholeheartedly agreed! > > Thanks Steven for starting, and others for participating in the discussion! > PEter > > Sreeram Garlapati <gsreeramku...@gmail.com> ezt írta (időpont: 2025. júl. > 15., K, 22:12): > >> Thanks Steven for starting this. >> >> I am interested in the - Index'ing related conversations. >> >> Here are some preliminary thoughts: >> >> 1. *Primary Index*: Conventionally Primary Index - just means what >> the Table's Primary storage layout/organization was. Given that Iceberg >> supports Sort-order - if the Spec adds constraints to derive/influence >> Sort >> order based on the Identifier columns - it satisfies the Primary Index >> criteria. >> 2. *Secondary Index*: Secondary Index storage calls for an efficient >> organization which can hold Secondary Keys along with the Location of the >> Row and any included columns. The index can be of many types, based on the >> Data. Iceberg tables are typically v.v.large. Hence, these Indexes also >> tend to be very large. Based on our past 1-2 years of work in this space, >> we found that an Iceberg Table based Store Secondary Index - provides the >> right balance between the ability to skip over and load needed sections >> and >> yet provide the right performance benefits. This decision was also shaped >> by popular opinion from many of our partners & customers - as the Index >> computation involves a lot of computation, they want it to be in an open >> format, so that it can be shared with other engines! >> 3. *Others: Full Text Search Indexes and Vector Indexes*: It is >> critical that we allow years of innovation in the space of Full Text >> Search >> and Vector indexes, especially with the current acceleration in AI >> adoption >> & the need it is driving on the Keyword and Similarity Search space. Given >> that Iceberg tables are extremely large, it is critical for us to provide >> a >> good story for Indexes that can be incrementally updated / partially >> loaded >> into memory. >> >> >> Looking forward to the discussions. >> >> Best, >> Sreeram >> >> On Tue, Jul 15, 2025 at 9:33 AM Anurag Mantripragada >> <amantriprag...@apple.com.invalid> wrote: >> >>> Thanks for starting this thread, Steven! >>> >>> I have been interested in secondary indexing in Iceberg. There was an >>> old proposal secondary indexing [1], we may need to revist/redesign these >>> structures. I agree this is a very broad topic and having indexing >>> structures general enough to support a wide range of use-cases will be a >>> key challenge. >>> >>> I would like to get involved any discussions related to indexing. >>> >>> [1] - >>> https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit?tab=t.0 >>> >>> >>> Thanks, >>> Anurag Mantripragada >>> >>> >>> On Jul 15, 2025, at 2:37 AM, Maximilian Michels <m...@apache.org> wrote: >>> >>> Thanks Steven for the summary. It would be great to extend the Iceberg >>> spec with index files, such that they can be used for the different use >>> cases. >>> >>> For my understanding, let me further outline the different types of use >>> cases for index files: >>> >>> --- >>> Topic 1: Accelerating the resolution of equality deletes >>> --- >>> >>> In its current form, equality deletes make it impossible to achieve >>> proper merge-on-read performance in streaming reads, and they also add a >>> significant performance overhead in batch pipelines. >>> >>> Approach (a): >>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ >>> Converting equality deletes to positional deletes would be a great >>> achievement. I'm wondering though, if all engines will be able to achieve >>> this. There is quite some runtime complexity involved to achieve this. If I >>> understand correctly, the index can be bootstrapped via table maintenance >>> tasks, then has to be maintained by the streaming writer. >>> >>> Approach (b): >>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv >>> This would boost the resolution of equality deletes during reads via >>> indices. The indices can be built via maintenance tasks, or directly by the >>> writer as in (a). But how to keep the index fresh if we don't write the >>> index at the writers? Readers won't always be able to use an >>> up-to-date index, making this less suitable for streaming reads. >>> >>> --- >>> Topic 2: Full text search in table scans >>> --- >>> >>> >>> https://docs.google.com/document/d/1bMACRCJBB8ycSXCFbP_BdCbFCAegRoxr2O2NXZirOmY/edit >>> Adding full-text search would broaden Iceberg’s applicability, enabling >>> new search use cases and making table scans far more powerful. >>> >>> Cheers, >>> Max >>> >>> On Wed, Jul 9, 2025 at 11:35 PM Steven Wu <stevenz...@gmail.com> wrote: >>> >>>> >>>> Similar to other V4 threads, I am starting a thread to gauge interest >>>> in adding index support in Iceberg V4 and gather a focus group in this >>>> area. >>>> >>>> There have been a few discussions related to indexing recently. >>>> >>>> - Me and Peter Vary are working on a proposal (WIP) to only write >>>> position deletes in the Flink streaming writer. It would need a primary >>>> key >>>> index to work reasonably efficiently. [1] >>>> - Xiaoxuan Li has a proposal to leverage index files to improve >>>> merge-on-read performance with equality deletes. [2] >>>> - pengzhiwei has a proposal to support full-text index and vector >>>> index. [3] >>>> >>>> >>>> *Idea: index files* >>>> >>>> To support those use cases, Iceberg can add support for index files (in >>>> addition to data files and delete files). It should be general enough to >>>> support different forms of indexing. >>>> >>>> - Primary key index >>>> - Secondary index >>>> - Full text index >>>> - Vector index >>>> >>>> >>>> This email is a starting point. It is a large topic. A lot of >>>> discussions and maturation of the ideas are needed before a formal >>>> proposal. >>>> >>>> Thanks, >>>> Steven >>>> >>>> [1] >>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/ >>>> (WIP) >>>> [2] https://lists.apache.org/thread/j4zl44g6dllzzyg9ln45pvgoosfhxqrq >>>> [3] https://github.com/apache/iceberg/issues/12636 >>>> >>>> >>>> >>>