Re: [DISCUSS] Flink: Equality delete to DV conversion

Maximilian Michels Thu, 16 Apr 2026 08:03:39 -0700

Thanks everyone for the discussion and support.

I've opened a PR which implements what we discussed here:
https://github.com/apache/iceberg/pull/15996


Just to cap:

The EqualityDeleteConverter (EDC) will be running in-line with the
writer, which produces the equality deletes to a staging branch. We
monitor the staging branch for new commits, build a sharded primary
key index in Flink state (backed by RocksDB for large tables), resolve
equality deletes against the index, and commit back the resulting data
files + DVs to the main branch.

The PR is split into several commits, which we can break out into
separate PRs for easier review. There are some limitations and
follow-ups listed in the PR. The biggest gaps are preserving row
lineage and lifecycle management of the staging branch. In a
follow-up, we will also add integration with the Flink IcebergSink. I
tried to keep the scope limited to the EDC maintenance task for the
first PR.

Cheers,
Max

On Thu, Apr 2, 2026 at 9:43 AM Márton Balassi <[email protected]> wrote:
>
> Thanks for raising this, Max and for the feedback Peter and Manu.
>
> I am supportive of this proposal, especially with the clearly defined vision 
> of eventually completely removing the need for equality deletes.
>
> Lifting the reliance on equality deletes in the Flink write path would be a 
> significant improvement, both in terms of read efficiency (by moving towards 
> delete vectors) and in terms of new capabilities, as it would make writing 
> upserts from Flink a viable path to explore going forward.
>
> Cheers,
> Marton
>
> On 2026/03/20 11:15:26 Maximilian Michels wrote:
> > Thanks Peter and Manu for the feedback.
> >
> > @Peter:
> >
> > Good point on the end goal. The end goal should be to completely
> > remove equality deletes.
> >
> > While the staging branch, which contains the equality deletes, is an
> > internal implementation detail of the Flink writer, it will still be
> > accessible via the Iceberg reader API. For the transition period, I
> > think this has several advantages:
> > 1. We don't need to fundamentally change the write logic of existing 
> > writers.
> > 2. We still allow for the data to be inspected before converting it
> > and merging it to the main branch. This is also helpful for
> > troubleshooting.
> >
> > The staging branch solution is a first step towards removing equality 
> > deletes.
> >
> > In V4, we could already deprecate equality deletes. Once the spec
> > includes indices, we can move the index into Iceberg, which should
> > make it easier to develop an in-place resolution of equality deletes
> > supporting multiple writers and conflict resolution. Admittedly, we
> > haven't fully figured out the best in-place approach. I think it is a
> > good idea to take it one step at a time.
> >
> > On row lineage: If we want to preserve the row id of updated rows, we
> > will have to store the row id in the primary key index. Theoretically,
> > we should be able to then add it to the corresponding new row. The
> > question is how to do that efficiently, such that we don't have to
> > rewrite any data files. We would need some way to map the row id of
> > the newly inserted row to the row id of the deleted row. Do we already
> > have such functionality in Iceberg?
> >
> > On concurrent writes: For the time being, I think we should not allow
> > concurrent maintenance tasks, including equality delete conversion.
> > Concurrent writes are still supported, as long as they go to the
> > staging branch.
> >
> > @Manu:
> >
> > +1 to Peter's response. The primary key index is bounded and
> > independent of the number of accumulated equality deletes, so memory
> > doesn't blow up, as long as we have sufficient resources to load the
> > index. We definitely cannot rely on the full index to fit into memory.
> > Fortunately, Flink is already prepared for this; it supports spilling
> > to disk via its RocksDB state backend.
> >
> > Cheers,
> > Max
> >
> > On Fri, Mar 20, 2026 at 11:07 AM Péter Váry <[email protected]> 
> > wrote:
> > >
> > > Equality delete resolution could be made significantly more efficient by 
> > > using an index (e.g., backed by RocksDB) to store the current mapping 
> > > from primary keys to (file, position). While the memory footprint would 
> > > not be small, it would be bounded and independent of the number of 
> > > accumulated equality deletes. In addition, even a blocking compaction 
> > > should block for a shorter period than the typical interval at which 
> > > table compactions are scheduled.
> > >
> > > Manu Zhang <[email protected]> ezt írta (időpont: 2026. márc. 19., 
> > > Cs, 15:57):
> > >>
> > >> Thanks Max for the proposal. One question here.
> > >> When the convert task can not finish in time (e.g. blocked by 
> > >> compactions), and equality deletes accumulate on the staging branch, 
> > >> will we have the same issue as loading too many equality deletes and 
> > >> blowing up memory?
> > >>
> > >> Regards,
> > >> Manu
> > >>
> > >> On Thu, Mar 19, 2026 at 2:45 PM Péter Váry <[email protected]> 
> > >> wrote:
> > >>>
> > >>> Thanks, Max, for continuing to push this forward.
> > >>>
> > >>> The proposal feels like a step in the right direction, but I would like 
> > >>> to see a clearer view of the end goal. As it stands, equality deletes 
> > >>> remain in the spec because the changes are committed to an intermediate 
> > >>> branch. Since the long‑term objective is to remove equality deletes 
> > >>> from the specification altogether, we should be clear about the final 
> > >>> solution that achieves this.
> > >>>
> > >>> Flink writes will also continue to have the limitation that row lineage 
> > >>> is not maintained correctly. This is unchanged from the current 
> > >>> situation, but I think it’s important to explicitly call this out, or 
> > >>> ideally, explore whether there’s a way to address it.
> > >>>
> > >>> In addition, concurrent writes and compactions would require updating 
> > >>> the primary key index, which could be expensive.
> > >>>
> > >>> That said, I don’t see a clearly better alternative at the moment, and 
> > >>> overall this seems like a reasonable way forward.
> > >>>
> > >>> Thanks again for continuing to drive the proposal.
> > >>> Peter
> > >>>
> > >>>
> > >>> On Wed, Mar 18, 2026, 16:47 Maximilian Michels <[email protected]> wrote:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> I'd like to discuss resolving equality deletes in the Flink write
> > >>>> path, which will get us one step closer to removing equality deletes
> > >>>> from the spec.
> > >>>>
> > >>>> ## tl;dr
> > >>>>
> > >>>> We're planning to add an equality delete to deletion vector (DV)
> > >>>> conversion to Flink. Equality deletes may remain as an internal
> > >>>> intermediary format.
> > >>>>
> > >>>> ## Background
> > >>>>
> > >>>> For deletes, Flink currently produces equality delete files.
> > >>>>
> > >>>> Equality deletes are used to support deletes in the write path, which
> > >>>> is a requirement for many use cases like CDC [3]. They are cheap for
> > >>>> the writer; it only notes down the to-be-deleted values of the
> > >>>> identifier fields inside so-called delete files, and leaves it up for
> > >>>> the readers to match the values to the corresponding rows. The heavy
> > >>>> lifting has to be done by the readers, which potentially need to scan
> > >>>> the entire table to resolve equality deletes.
> > >>>>
> > >>>> Therefore, equality deletes have been criticized. There are
> > >>>> discussions around deprecating / removing them [1].
> > >>>>
> > >>>> ## Resolving Equality Deletes
> > >>>>
> > >>>> Steven, Peter, and a few other contributors came up with a proposal to
> > >>>> convert equality deletes into DV [2]. The original solution was quite
> > >>>> complex, mainly due to the conflict handling between streaming writes,
> > >>>> table maintenance, and equality delete resolution. The proposal is
> > >>>> also blocked on index support in the Iceberg spec [5].
> > >>>>
> > >>>> We may need to simplify further to make some progress. The old table
> > >>>> specs are going to be around for some time, even after we have a new
> > >>>> spec with index support. Users have been asking for a solution to this
> > >>>> issue for quite some time [3].
> > >>>>
> > >>>> The following is a modification of the original design document which
> > >>>> adapts the ideas described under "use lock to avoid conflicts" [2].
> > >>>>
> > >>>> ## Proposed Solution
> > >>>>
> > >>>> The idea is to add the equality delete to deletion vector (DV)
> > >>>> conversion as a Flink table maintenance task. After recent changes, we
> > >>>> can now run the writer and the maintenance in the same Flink job and
> > >>>> use a Flink-maintained lock to avoid conflicts between the maintenance
> > >>>> tasks.
> > >>>>
> > >>>> 1. Instead of writing directly to the target branch, the writer
> > >>>> commits data files + equality deletes to a staging branch.
> > >>>> 2. The new "EqualityDeleteResolver" maintenance task reads from the
> > >>>> staging branch and converts the equality deletes to DVs using a
> > >>>> Flink-maintained primary key index, then commits data files + DVs to
> > >>>> the target branch.
> > >>>> 3. The existing Flink maintenance framework's lock mechanism ensures
> > >>>> mutual exclusion between the convert task and table compaction to
> > >>>> avoid conflicts.
> > >>>>
> > >>>> After conversion, the target branch contains only data files and DVs,
> > >>>> no equality deletes.
> > >>>>
> > >>>> ## Limitations
> > >>>>
> > >>>> - Readers will only see new data until the conversion is complete.
> > >>>> This is partially mitigated by the fact that snapshots with equality
> > >>>> deletes cannot be read properly with Flink today [4].
> > >>>> - The Flink-maintained index needs to be built initially which
> > >>>> requires reading the entire table. We will use Flink's state backend
> > >>>> which apart from heap-based storage, also supports spilling to disk
> > >>>> via the RocksDB state backend.
> > >>>>
> > >>>> ## Wrapping up
> > >>>>
> > >>>> This solution may not be perfect because of the above limitations, but
> > >>>> it provides a viable path to free users of the burden of equality
> > >>>> deletes, which cannot be read efficiently by most engines today.
> > >>>> Eventually, the Flink-maintained index can be replaced by an Iceberg
> > >>>> index, which will allow for the index to be shared across engines.
> > >>>>
> > >>>> What does the community think?
> > >>>>
> > >>>> Thanks,
> > >>>> Max
> > >>>>
> > >>>> [1] Deprecate equality deletes:
> > >>>> https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6
> > >>>> [2] Design doc:
> > >>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/edit
> > >>>> [3] Upserts use case:
> > >>>> https://lists.apache.org/thread/rt7dmg7l78xpzc9w3lwn090yzqq4fyyw
> > >>>> [4] Handling upserts downstream:
> > >>>> https://lists.apache.org/thread/bx1ntfr45c0g9rh643yw7w8znv6wtrno
> > >>>> [5] V4 Index: 
> > >>>> https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j
> >

Re: [DISCUSS] Flink: Equality delete to DV conversion

Reply via email to