Hi Iceberg community,

What are the current best practices for streaming upserts into an Iceberg
table?

Today, we have the following setup in production to support CDC:

1. A Flink job that continuously appends CDC events into an append-only raw
table
2, A periodically scheduled Spark job that performs upsert the `current`
table using `raw` table

We are exploring whether it’s feasible to stream upserts directly into an
Iceberg table from Flink. This could simplify our architecture and
potentially further reduce our data SLA. We’ve experimented with this
approach before, but ran into reader-side performance issues due to the
accumulation of equality deletes over time.

>From what I can gather, streaming upserts still seems to be an open design
area:

1. (Please correct me if I’m wrong—this summary is partly based on ChatGPT
5.1.) The book “Apache Iceberg: The Definitive Guide” suggests the
two-table pattern we’re currently using in production.
2.  These threads:
https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv ,
https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j discuss
the idea of outputting only positional deletes (no equality deletes) by
introducing an index. However, this appears to still be under discussion
and may be targeted for v4, with no concrete timeline yet.
3. this thread
https://lists.apache.org/thread/6fhpjszsfxd8p0vfzc3k5vw7zmcyv2mq talks
about deprecating equality deletes, but I haven’t seen a clearly defined
alternative come out of that discussion yet.

Given all of the above, I’d really appreciate guidance from the community
on:

1. Recommended patterns for streaming upserts with Flink into Iceberg today
(it's good to know the long term possible as well, but my focus is what's
possible in near term).
2. Practical experiences or lessons learned from teams running streaming
upserts in production

Thanks in advance for any insights and corrections.

Best
Lu

Reply via email to