Re: Best practices for streaming upserts into Iceberg tables

Steven Wu Tue, 20 Jan 2026 23:27:05 -0800

Lu,

you are correct about the design doc for Flink writing position deletes
only. The original design has high complexity. We were thinking about
alternatives with narrower scope. But there isn't any progress and timeline
.


IMHO, your setup is a good practice today. Ryan wrote a series of blogs for
the pattern: https://tabular.medium.com/hello-world-of-cdc-e6f06ddbfcc0.

Some people use the current Flink Iceberg sink for CDC ingestion. But it
would produce equality deletes that would require aggressive compactions
and add operational burden too. Also not all engines can read equality
deletes.

Thanks,
Steven

On Tue, Jan 20, 2026 at 8:44 PM Gang Wu <[email protected]> wrote:

> Hi Lu,
>
> Nice to hear from you here in the Iceberg community :)
>
> We have built an internal service to stream upserts into position deletes
> which happens to have a lot in common with [1] and [2]. I believe this is a
> viable approach to achieve second freshness.
>
> [1]
> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk
> [2] https://www.mooncake.dev/whitepaper
>
> Best,
> Gang
>
>
>
>
> On Wed, Jan 21, 2026 at 11:05 AM Lu Niu <[email protected]> wrote:
>
>> Hi Iceberg community,
>>
>> What are the current best practices for streaming upserts into an Iceberg
>> table?
>>
>> Today, we have the following setup in production to support CDC:
>>
>> 1. A Flink job that continuously appends CDC events into an append-only
>> raw table
>> 2, A periodically scheduled Spark job that performs upsert the `current`
>> table using `raw` table
>>
>> We are exploring whether it’s feasible to stream upserts directly into an
>> Iceberg table from Flink. This could simplify our architecture and
>> potentially further reduce our data SLA. We’ve experimented with this
>> approach before, but ran into reader-side performance issues due to the
>> accumulation of equality deletes over time.
>>
>> From what I can gather, streaming upserts still seems to be an open
>> design area:
>>
>> 1. (Please correct me if I’m wrong—this summary is partly based on
>> ChatGPT 5.1.) The book “Apache Iceberg: The Definitive Guide” suggests the
>> two-table pattern we’re currently using in production.
>> 2.  These threads:
>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv ,
>> https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j discuss
>> the idea of outputting only positional deletes (no equality deletes) by
>> introducing an index. However, this appears to still be under discussion
>> and may be targeted for v4, with no concrete timeline yet.
>> 3. this thread
>> https://lists.apache.org/thread/6fhpjszsfxd8p0vfzc3k5vw7zmcyv2mq talks
>> about deprecating equality deletes, but I haven’t seen a clearly defined
>> alternative come out of that discussion yet.
>>
>> Given all of the above, I’d really appreciate guidance from the community
>> on:
>>
>> 1. Recommended patterns for streaming upserts with Flink into Iceberg
>> today (it's good to know the long term possible as well, but my focus is
>> what's possible in near term).
>> 2. Practical experiences or lessons learned from teams running streaming
>> upserts in production
>>
>> Thanks in advance for any insights and corrections.
>>
>> Best
>> Lu
>>
>

Re: Best practices for streaming upserts into Iceberg tables

Reply via email to