Hi Iceberg community, What are the current best practices for streaming upserts into an Iceberg table?
Today, we have the following setup in production to support CDC: 1. A Flink job that continuously appends CDC events into an append-only raw table 2, A periodically scheduled Spark job that performs upsert the `current` table using `raw` table We are exploring whether it’s feasible to stream upserts directly into an Iceberg table from Flink. This could simplify our architecture and potentially further reduce our data SLA. We’ve experimented with this approach before, but ran into reader-side performance issues due to the accumulation of equality deletes over time. >From what I can gather, streaming upserts still seems to be an open design area: 1. (Please correct me if I’m wrong—this summary is partly based on ChatGPT 5.1.) The book “Apache Iceberg: The Definitive Guide” suggests the two-table pattern we’re currently using in production. 2. These threads: https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv , https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j discuss the idea of outputting only positional deletes (no equality deletes) by introducing an index. However, this appears to still be under discussion and may be targeted for v4, with no concrete timeline yet. 3. this thread https://lists.apache.org/thread/6fhpjszsfxd8p0vfzc3k5vw7zmcyv2mq talks about deprecating equality deletes, but I haven’t seen a clearly defined alternative come out of that discussion yet. Given all of the above, I’d really appreciate guidance from the community on: 1. Recommended patterns for streaming upserts with Flink into Iceberg today (it's good to know the long term possible as well, but my focus is what's possible in near term). 2. Practical experiences or lessons learned from teams running streaming upserts in production Thanks in advance for any insights and corrections. Best Lu
