Flink CDC support reading binlog data from databases such as MySQL and PostgreSQL, and writing it to Iceberg, Hudi, and Paimon. https://github.com/apache/flink-cdc/pulls?q=iceberg
Steven Wu <[email protected]> 于2026年1月21日周三 15:27写道: > Lu, > > you are correct about the design doc for Flink writing position deletes > only. The original design has high complexity. We were thinking about > alternatives with narrower scope. But there isn't any progress and timeline > . > > IMHO, your setup is a good practice today. Ryan wrote a series of blogs > for the pattern: > https://tabular.medium.com/hello-world-of-cdc-e6f06ddbfcc0. > > Some people use the current Flink Iceberg sink for CDC ingestion. But it > would produce equality deletes that would require aggressive compactions > and add operational burden too. Also not all engines can read equality > deletes. > > Thanks, > Steven > > On Tue, Jan 20, 2026 at 8:44 PM Gang Wu <[email protected]> wrote: > >> Hi Lu, >> >> Nice to hear from you here in the Iceberg community :) >> >> We have built an internal service to stream upserts into position deletes >> which happens to have a lot in common with [1] and [2]. I believe this is a >> viable approach to achieve second freshness. >> >> [1] >> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk >> [2] https://www.mooncake.dev/whitepaper >> >> Best, >> Gang >> >> >> >> >> On Wed, Jan 21, 2026 at 11:05 AM Lu Niu <[email protected]> wrote: >> >>> Hi Iceberg community, >>> >>> What are the current best practices for streaming upserts into an >>> Iceberg table? >>> >>> Today, we have the following setup in production to support CDC: >>> >>> 1. A Flink job that continuously appends CDC events into an append-only >>> raw table >>> 2, A periodically scheduled Spark job that performs upsert the `current` >>> table using `raw` table >>> >>> We are exploring whether it’s feasible to stream upserts directly into >>> an Iceberg table from Flink. This could simplify our architecture and >>> potentially further reduce our data SLA. We’ve experimented with this >>> approach before, but ran into reader-side performance issues due to the >>> accumulation of equality deletes over time. >>> >>> From what I can gather, streaming upserts still seems to be an open >>> design area: >>> >>> 1. (Please correct me if I’m wrong—this summary is partly based on >>> ChatGPT 5.1.) The book “Apache Iceberg: The Definitive Guide” suggests the >>> two-table pattern we’re currently using in production. >>> 2. These threads: >>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv , >>> https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j discuss >>> the idea of outputting only positional deletes (no equality deletes) by >>> introducing an index. However, this appears to still be under discussion >>> and may be targeted for v4, with no concrete timeline yet. >>> 3. this thread >>> https://lists.apache.org/thread/6fhpjszsfxd8p0vfzc3k5vw7zmcyv2mq talks >>> about deprecating equality deletes, but I haven’t seen a clearly defined >>> alternative come out of that discussion yet. >>> >>> Given all of the above, I’d really appreciate guidance from the >>> community on: >>> >>> 1. Recommended patterns for streaming upserts with Flink into Iceberg >>> today (it's good to know the long term possible as well, but my focus is >>> what's possible in near term). >>> 2. Practical experiences or lessons learned from teams running streaming >>> upserts in production >>> >>> Thanks in advance for any insights and corrections. >>> >>> Best >>> Lu >>> >>
