Re: Best practices for streaming upserts into Iceberg tables

Gang Wu Tue, 20 Jan 2026 20:52:47 -0800

Hi Lu,

Nice to hear from you here in the Iceberg community :)


We have built an internal service to stream upserts into position deletes
which happens to have a lot in common with [1] and [2]. I believe this is a
viable approach to achieve second freshness.

[1]
https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk
[2] https://www.mooncake.dev/whitepaper

Best,
Gang




On Wed, Jan 21, 2026 at 11:05 AM Lu Niu <[email protected]> wrote:

> Hi Iceberg community,
>
> What are the current best practices for streaming upserts into an Iceberg
> table?
>
> Today, we have the following setup in production to support CDC:
>
> 1. A Flink job that continuously appends CDC events into an append-only
> raw table
> 2, A periodically scheduled Spark job that performs upsert the `current`
> table using `raw` table
>
> We are exploring whether it’s feasible to stream upserts directly into an
> Iceberg table from Flink. This could simplify our architecture and
> potentially further reduce our data SLA. We’ve experimented with this
> approach before, but ran into reader-side performance issues due to the
> accumulation of equality deletes over time.
>
> From what I can gather, streaming upserts still seems to be an open design
> area:
>
> 1. (Please correct me if I’m wrong—this summary is partly based on ChatGPT
> 5.1.) The book “Apache Iceberg: The Definitive Guide” suggests the
> two-table pattern we’re currently using in production.
> 2.  These threads:
> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv ,
> https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j discuss
> the idea of outputting only positional deletes (no equality deletes) by
> introducing an index. However, this appears to still be under discussion
> and may be targeted for v4, with no concrete timeline yet.
> 3. this thread
> https://lists.apache.org/thread/6fhpjszsfxd8p0vfzc3k5vw7zmcyv2mq talks
> about deprecating equality deletes, but I haven’t seen a clearly defined
> alternative come out of that discussion yet.
>
> Given all of the above, I’d really appreciate guidance from the community
> on:
>
> 1. Recommended patterns for streaming upserts with Flink into Iceberg
> today (it's good to know the long term possible as well, but my focus is
> what's possible in near term).
> 2. Practical experiences or lessons learned from teams running streaming
> upserts in production
>
> Thanks in advance for any insights and corrections.
>
> Best
> Lu
>

Re: Best practices for streaming upserts into Iceberg tables

Reply via email to