Sorry for the delayed response. I agree that this proposal could get us one step closer to removing equality deletes. It can also verify whether the index solution can scale for streaming CDC/Upsert writes.
> For this PR, we opted to use deletion vectors over regular delete files due to their efficiency in terms of space and lookup. I didn't quite get this from the PR description. Does this mean it is limited to V3 tables? Thanks for working on this, Max! On Thu, Apr 16, 2026 at 8:03 AM Maximilian Michels <[email protected]> wrote: > Thanks everyone for the discussion and support. > > I've opened a PR which implements what we discussed here: > https://github.com/apache/iceberg/pull/15996 > > Just to cap: > > The EqualityDeleteConverter (EDC) will be running in-line with the > writer, which produces the equality deletes to a staging branch. We > monitor the staging branch for new commits, build a sharded primary > key index in Flink state (backed by RocksDB for large tables), resolve > equality deletes against the index, and commit back the resulting data > files + DVs to the main branch. > > The PR is split into several commits, which we can break out into > separate PRs for easier review. There are some limitations and > follow-ups listed in the PR. The biggest gaps are preserving row > lineage and lifecycle management of the staging branch. In a > follow-up, we will also add integration with the Flink IcebergSink. I > tried to keep the scope limited to the EDC maintenance task for the > first PR. > > Cheers, > Max > > On Thu, Apr 2, 2026 at 9:43 AM Márton Balassi <[email protected]> wrote: > > > > Thanks for raising this, Max and for the feedback Peter and Manu. > > > > I am supportive of this proposal, especially with the clearly defined > vision of eventually completely removing the need for equality deletes. > > > > Lifting the reliance on equality deletes in the Flink write path would > be a significant improvement, both in terms of read efficiency (by moving > towards delete vectors) and in terms of new capabilities, as it would make > writing upserts from Flink a viable path to explore going forward. > > > > Cheers, > > Marton > > > > On 2026/03/20 11:15:26 Maximilian Michels wrote: > > > Thanks Peter and Manu for the feedback. > > > > > > @Peter: > > > > > > Good point on the end goal. The end goal should be to completely > > > remove equality deletes. > > > > > > While the staging branch, which contains the equality deletes, is an > > > internal implementation detail of the Flink writer, it will still be > > > accessible via the Iceberg reader API. For the transition period, I > > > think this has several advantages: > > > 1. We don't need to fundamentally change the write logic of existing > writers. > > > 2. We still allow for the data to be inspected before converting it > > > and merging it to the main branch. This is also helpful for > > > troubleshooting. > > > > > > The staging branch solution is a first step towards removing equality > deletes. > > > > > > In V4, we could already deprecate equality deletes. Once the spec > > > includes indices, we can move the index into Iceberg, which should > > > make it easier to develop an in-place resolution of equality deletes > > > supporting multiple writers and conflict resolution. Admittedly, we > > > haven't fully figured out the best in-place approach. I think it is a > > > good idea to take it one step at a time. > > > > > > On row lineage: If we want to preserve the row id of updated rows, we > > > will have to store the row id in the primary key index. Theoretically, > > > we should be able to then add it to the corresponding new row. The > > > question is how to do that efficiently, such that we don't have to > > > rewrite any data files. We would need some way to map the row id of > > > the newly inserted row to the row id of the deleted row. Do we already > > > have such functionality in Iceberg? > > > > > > On concurrent writes: For the time being, I think we should not allow > > > concurrent maintenance tasks, including equality delete conversion. > > > Concurrent writes are still supported, as long as they go to the > > > staging branch. > > > > > > @Manu: > > > > > > +1 to Peter's response. The primary key index is bounded and > > > independent of the number of accumulated equality deletes, so memory > > > doesn't blow up, as long as we have sufficient resources to load the > > > index. We definitely cannot rely on the full index to fit into memory. > > > Fortunately, Flink is already prepared for this; it supports spilling > > > to disk via its RocksDB state backend. > > > > > > Cheers, > > > Max > > > > > > On Fri, Mar 20, 2026 at 11:07 AM Péter Váry < > [email protected]> wrote: > > > > > > > > Equality delete resolution could be made significantly more > efficient by using an index (e.g., backed by RocksDB) to store the current > mapping from primary keys to (file, position). While the memory footprint > would not be small, it would be bounded and independent of the number of > accumulated equality deletes. In addition, even a blocking compaction > should block for a shorter period than the typical interval at which table > compactions are scheduled. > > > > > > > > Manu Zhang <[email protected]> ezt írta (időpont: 2026. márc. > 19., Cs, 15:57): > > > >> > > > >> Thanks Max for the proposal. One question here. > > > >> When the convert task can not finish in time (e.g. blocked by > compactions), and equality deletes accumulate on the staging branch, will > we have the same issue as loading too many equality deletes and blowing up > memory? > > > >> > > > >> Regards, > > > >> Manu > > > >> > > > >> On Thu, Mar 19, 2026 at 2:45 PM Péter Váry < > [email protected]> wrote: > > > >>> > > > >>> Thanks, Max, for continuing to push this forward. > > > >>> > > > >>> The proposal feels like a step in the right direction, but I would > like to see a clearer view of the end goal. As it stands, equality deletes > remain in the spec because the changes are committed to an intermediate > branch. Since the long‑term objective is to remove equality deletes from > the specification altogether, we should be clear about the final solution > that achieves this. > > > >>> > > > >>> Flink writes will also continue to have the limitation that row > lineage is not maintained correctly. This is unchanged from the current > situation, but I think it’s important to explicitly call this out, or > ideally, explore whether there’s a way to address it. > > > >>> > > > >>> In addition, concurrent writes and compactions would require > updating the primary key index, which could be expensive. > > > >>> > > > >>> That said, I don’t see a clearly better alternative at the moment, > and overall this seems like a reasonable way forward. > > > >>> > > > >>> Thanks again for continuing to drive the proposal. > > > >>> Peter > > > >>> > > > >>> > > > >>> On Wed, Mar 18, 2026, 16:47 Maximilian Michels <[email protected]> > wrote: > > > >>>> > > > >>>> Hi, > > > >>>> > > > >>>> I'd like to discuss resolving equality deletes in the Flink write > > > >>>> path, which will get us one step closer to removing equality > deletes > > > >>>> from the spec. > > > >>>> > > > >>>> ## tl;dr > > > >>>> > > > >>>> We're planning to add an equality delete to deletion vector (DV) > > > >>>> conversion to Flink. Equality deletes may remain as an internal > > > >>>> intermediary format. > > > >>>> > > > >>>> ## Background > > > >>>> > > > >>>> For deletes, Flink currently produces equality delete files. > > > >>>> > > > >>>> Equality deletes are used to support deletes in the write path, > which > > > >>>> is a requirement for many use cases like CDC [3]. They are cheap > for > > > >>>> the writer; it only notes down the to-be-deleted values of the > > > >>>> identifier fields inside so-called delete files, and leaves it up > for > > > >>>> the readers to match the values to the corresponding rows. The > heavy > > > >>>> lifting has to be done by the readers, which potentially need to > scan > > > >>>> the entire table to resolve equality deletes. > > > >>>> > > > >>>> Therefore, equality deletes have been criticized. There are > > > >>>> discussions around deprecating / removing them [1]. > > > >>>> > > > >>>> ## Resolving Equality Deletes > > > >>>> > > > >>>> Steven, Peter, and a few other contributors came up with a > proposal to > > > >>>> convert equality deletes into DV [2]. The original solution was > quite > > > >>>> complex, mainly due to the conflict handling between streaming > writes, > > > >>>> table maintenance, and equality delete resolution. The proposal is > > > >>>> also blocked on index support in the Iceberg spec [5]. > > > >>>> > > > >>>> We may need to simplify further to make some progress. The old > table > > > >>>> specs are going to be around for some time, even after we have a > new > > > >>>> spec with index support. Users have been asking for a solution to > this > > > >>>> issue for quite some time [3]. > > > >>>> > > > >>>> The following is a modification of the original design document > which > > > >>>> adapts the ideas described under "use lock to avoid conflicts" > [2]. > > > >>>> > > > >>>> ## Proposed Solution > > > >>>> > > > >>>> The idea is to add the equality delete to deletion vector (DV) > > > >>>> conversion as a Flink table maintenance task. After recent > changes, we > > > >>>> can now run the writer and the maintenance in the same Flink job > and > > > >>>> use a Flink-maintained lock to avoid conflicts between the > maintenance > > > >>>> tasks. > > > >>>> > > > >>>> 1. Instead of writing directly to the target branch, the writer > > > >>>> commits data files + equality deletes to a staging branch. > > > >>>> 2. The new "EqualityDeleteResolver" maintenance task reads from > the > > > >>>> staging branch and converts the equality deletes to DVs using a > > > >>>> Flink-maintained primary key index, then commits data files + DVs > to > > > >>>> the target branch. > > > >>>> 3. The existing Flink maintenance framework's lock mechanism > ensures > > > >>>> mutual exclusion between the convert task and table compaction to > > > >>>> avoid conflicts. > > > >>>> > > > >>>> After conversion, the target branch contains only data files and > DVs, > > > >>>> no equality deletes. > > > >>>> > > > >>>> ## Limitations > > > >>>> > > > >>>> - Readers will only see new data until the conversion is complete. > > > >>>> This is partially mitigated by the fact that snapshots with > equality > > > >>>> deletes cannot be read properly with Flink today [4]. > > > >>>> - The Flink-maintained index needs to be built initially which > > > >>>> requires reading the entire table. We will use Flink's state > backend > > > >>>> which apart from heap-based storage, also supports spilling to > disk > > > >>>> via the RocksDB state backend. > > > >>>> > > > >>>> ## Wrapping up > > > >>>> > > > >>>> This solution may not be perfect because of the above > limitations, but > > > >>>> it provides a viable path to free users of the burden of equality > > > >>>> deletes, which cannot be read efficiently by most engines today. > > > >>>> Eventually, the Flink-maintained index can be replaced by an > Iceberg > > > >>>> index, which will allow for the index to be shared across engines. > > > >>>> > > > >>>> What does the community think? > > > >>>> > > > >>>> Thanks, > > > >>>> Max > > > >>>> > > > >>>> [1] Deprecate equality deletes: > > > >>>> https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6 > > > >>>> [2] Design doc: > > > >>>> > https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/edit > > > >>>> [3] Upserts use case: > > > >>>> https://lists.apache.org/thread/rt7dmg7l78xpzc9w3lwn090yzqq4fyyw > > > >>>> [4] Handling upserts downstream: > > > >>>> https://lists.apache.org/thread/bx1ntfr45c0g9rh643yw7w8znv6wtrno > > > >>>> [5] V4 Index: > https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j > > > >
