Re: Best practices for streaming upserts into Iceberg tables

Péter Váry Thu, 22 Jan 2026 00:03:58 -0800

Hi Lu,

Steven and I have other priorities at the moment, so please feel free to
pick up any loose threads here.


We gained quite a lot by relaxing some of the requirements from the
original proposal. If we accept that equality deletes remain, but compacted
very soon, we could get at an even more limited change that still helps
your use case and points in the right overall direction.

My initial proposal would be:

   - First, ensure that Guo’s PR enabling Flink TableMaintenance without a
   LockManager gets merged: https://github.com/apache/iceberg/pull/15042
   - Introduce a TableMaintenance task that handles newly written equality
   deletes:
      - Maintain a PK → (filename, position) map in memory, distributed
      across the TaskManagers.
      - In the absence of indexes for now, this map can be reconstructed on
      every job start.
      - Update the map whenever new table changes are committed.
      - Convert PK-based equality deletes into position deletes.
   - Place this task at the end of the streaming ingest pipeline.

This approach builds entirely on existing components and can later be
enhanced with proper index support. If we decide to refactor the ingest
stream, we could even avoid committing equality deletes to the table
altogether and remove the need for them in the Flink jobs.

I don’t have the bandwidth to push this forward myself, but I’d be very
happy to help with proposal and code reviews.

Thanks,
Peter



On Wed, Jan 21, 2026, 18:48 Lu Niu <[email protected]> wrote:

> Thank you for all your replies!
>
> > Some people use the current Flink Iceberg sink for CDC ingestion. But it
> would produce equality deletes that would require aggressive compactions
> and add operational burden too
>
> Since the main concern is reader-side performance degradation due to the
> accumulation of equality deletes over time. Is there a way to estimate
> impact on the reader side based on equality deletes in a snapshot summary?
>
> ```
> while new_snapshot_summary_is_ready:
>     should_compact = analyze_snapshot_summary(snapshot_summary)
>     if should_compact:
>         rewrite_data_files()
> ```
>
> > The original design has high complexity. We were thinking about
> alternatives with narrower scope. But there isn't any progress and timeline
> .
>
> If this is the community aligned long term,  Is there any way I could
> contribute to speed this up? Thanks!
>
> Best
> Lu
>
>
> On Wed, Jan 21, 2026 at 2:18 AM Maximilian Michels <[email protected]> wrote:
>
>> Hi Lu,
>>
>> Just to iterate the status quo: Flink supports upserts, but only via
>> equality delete + append. So technically, "streaming writes" aren't an
>> issue. It's the read path which causes the issue, because unlike
>> positional deletes, which can be resolved on the fly during streaming
>> reads, equality deletes potentially require a full table scan to be
>> materialized. Constant snapshot compaction is required to keep the
>> read path efficient.
>>
>> >1. A Flink job that continuously appends CDC events into an append-only
>> raw table
>> >2. A periodically scheduled Spark job that performs upsert the `current`
>> table using `raw` table
>>
>> This makes sense. Conceptually, you are pre-compacting upserts before
>> writing into the final "current" table. This avoids equality deletes
>> entirely and keeps the read path on the "current" table efficient at
>> all times. The drawback is that your lower bound latency will be the
>> interval at which the Spark job runs, but this is an acceptable price
>> to pay, until we have a way to write positional deletes right away,
>> avoiding equality deletes entirely.
>>
>> Cheers,
>> Max
>>
>> On Wed, Jan 21, 2026 at 8:48 AM melin li <[email protected]> wrote:
>> >
>> > Flink CDC support reading binlog data from databases such as MySQL and
>> PostgreSQL, and writing it to Iceberg, Hudi, and Paimon.
>> > https://github.com/apache/flink-cdc/pulls?q=iceberg
>> >
>> > Steven Wu <[email protected]> 于2026年1月21日周三 15:27写道：
>> >>
>> >> Lu,
>> >>
>> >> you are correct about the design doc for Flink writing position
>> deletes only. The original design has high complexity. We were thinking
>> about alternatives with narrower scope. But there isn't any progress and
>> timeline .
>> >>
>> >> IMHO, your setup is a good practice today. Ryan wrote a series of
>> blogs for the pattern:
>> https://tabular.medium.com/hello-world-of-cdc-e6f06ddbfcc0.
>> >>
>> >> Some people use the current Flink Iceberg sink for CDC ingestion. But
>> it would produce equality deletes that would require aggressive compactions
>> and add operational burden too. Also not all engines can read equality
>> deletes.
>> >>
>> >> Thanks,
>> >> Steven
>> >>
>> >> On Tue, Jan 20, 2026 at 8:44 PM Gang Wu <[email protected]> wrote:
>> >>>
>> >>> Hi Lu,
>> >>>
>> >>> Nice to hear from you here in the Iceberg community :)
>> >>>
>> >>> We have built an internal service to stream upserts into position
>> deletes which happens to have a lot in common with [1] and [2]. I believe
>> this is a viable approach to achieve second freshness.
>> >>>
>> >>> [1]
>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk
>> >>> [2] https://www.mooncake.dev/whitepaper
>> >>>
>> >>> Best,
>> >>> Gang
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Jan 21, 2026 at 11:05 AM Lu Niu <[email protected]> wrote:
>> >>>>
>> >>>> Hi Iceberg community,
>> >>>>
>> >>>> What are the current best practices for streaming upserts into an
>> Iceberg table?
>> >>>>
>> >>>> Today, we have the following setup in production to support CDC:
>> >>>>
>> >>>> 1. A Flink job that continuously appends CDC events into an
>> append-only raw table
>> >>>> 2, A periodically scheduled Spark job that performs upsert the
>> `current` table using `raw` table
>> >>>>
>> >>>> We are exploring whether it’s feasible to stream upserts directly
>> into an Iceberg table from Flink. This could simplify our architecture and
>> potentially further reduce our data SLA. We’ve experimented with this
>> approach before, but ran into reader-side performance issues due to the
>> accumulation of equality deletes over time.
>> >>>>
>> >>>> From what I can gather, streaming upserts still seems to be an open
>> design area:
>> >>>>
>> >>>> 1. (Please correct me if I’m wrong—this summary is partly based on
>> ChatGPT 5.1.) The book “Apache Iceberg: The Definitive Guide” suggests the
>> two-table pattern we’re currently using in production.
>> >>>> 2.  These threads:
>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv ,
>> https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j discuss
>> the idea of outputting only positional deletes (no equality deletes) by
>> introducing an index. However, this appears to still be under discussion
>> and may be targeted for v4, with no concrete timeline yet.
>> >>>> 3. this thread
>> https://lists.apache.org/thread/6fhpjszsfxd8p0vfzc3k5vw7zmcyv2mq talks
>> about deprecating equality deletes, but I haven’t seen a clearly defined
>> alternative come out of that discussion yet.
>> >>>>
>> >>>> Given all of the above, I’d really appreciate guidance from the
>> community on:
>> >>>>
>> >>>> 1. Recommended patterns for streaming upserts with Flink into
>> Iceberg today (it's good to know the long term possible as well, but my
>> focus is what's possible in near term).
>> >>>> 2. Practical experiences or lessons learned from teams running
>> streaming upserts in production
>> >>>>
>> >>>> Thanks in advance for any insights and corrections.
>> >>>>
>> >>>> Best
>> >>>> Lu
>>
>

Re: Best practices for streaming upserts into Iceberg tables

Reply via email to