> Flink CDC [1] now provides full-DB sync and schema evolution ability as a pipeline job.
Ah! That is very cool. > Iceberg sink support was suggested before, and we’re trying to implement this in the next few releases. Does it cover the use-cases you mentioned? Yes! That would be fantastic. On Sun, May 26, 2024 at 11:56 PM Hang Ruan <ruanhang1...@gmail.com> wrote: > Hi, all. > > Flink CDC provides the schema evolution ability to sync the entire > database. I think it could satisfy your needs. > Flink CDC pipeline sources and sinks are listed in [1]. Iceberg pipeline > connector is not provided by now. > > > What is not is the automatic syncing of entire databases, with schema > evolution and detection of new (and dropped?) tables. :) > > Flink CDC is able to sync the entire database with schema evolutions. If a > new table is added to this database, the running pipeline job cannot sync > it. > But we could enable 'scan.newly-added-table.enabled' and restart this job > with a savepoint to catch the new tables. > This feature for MySQL pipeline connector is not released now. But the > PR[2] has been provided. > > Best, > Hang > > [1] > https://nightlies.apache.org/flink/flink-cdc-docs-release-3.1/docs/connectors/pipeline-connectors/overview/ > [2] https://github.com/apache/flink-cdc/pull/3347 > > Xiqian YU <kono....@outlook.com> 于2024年5月27日周一 10:04写道: > >> Hi Otto, >> >> >> >> Flink CDC [1] now provides full-DB sync and schema evolution ability as a >> pipeline job. Iceberg sink support was suggested before, and we’re trying >> to implement this in the next few releases. Does it cover the use-cases you >> mentioned? >> >> >> >> [1] https://nightlies.apache.org/flink/flink-cdc-docs-stable/ >> >> [2] https://issues.apache.org/jira/browse/FLINK-34840 >> >> >> >> Regards, >> >> Xiqian >> >> >> >> >> >> *De : *Andrew Otto <o...@wikimedia.org> >> *Date : *vendredi, 24 mai 2024 à 23:06 >> *À : *Giannis Polyzos <ipolyzos...@gmail.com> >> *Cc : *Carlos Sanabria Miranda <sanabria.miranda.car...@gmail.com>, >> Oscar Perez via user <user@flink.apache.org>, Péter Váry < >> peter.vary.apa...@gmail.com>, mbala...@apache.org <mbala...@apache.org> >> *Objet : *Re: "Self-service ingestion pipelines with evolving schema via >> Flink and Iceberg" presentation recording from Flink Forward Seattle 2023 >> >> Indeed, using Flink-CDC to write to Flink Sink Tables, including Iceberg, >> is supported. >> >> >> >> What is not is the automatic syncing of entire databases, with schema >> evolution and detection of new (and dropped?) tables. :) >> >> >> >> >> >> >> >> >> >> On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos <ipolyzos...@gmail.com> >> wrote: >> >> https://nightlies.apache.org/flink/flink-cdc-docs-stable/ >> >> All these features come from Flink cdc itself. Because Paimon and Flink >> cdc are projects native to Flink there is a strong integration between them. >> >> (I believe it’s on the roadmap to support iceberg as well) >> >> >> >> On Fri, 24 May 2024 at 3:52 PM, Andrew Otto <o...@wikimedia.org> wrote: >> >> > I’m curious if there is any reason for choosing Iceberg instead of >> Paimon >> >> >> No technical reason that I'm aware of. We are using it mostly because of >> momentum. We looked at Flink Table Store (before it was Paimon), but >> decided it was too early and the docs were too sparse at the time to really >> consider it. >> >> >> >> > Especially for a use case like CDC that iceberg struggles to support. >> >> >> >> We aren't doing any CDC right now (for many reasons), but I have never >> seen a feature like Paimon's database sync before. One job to sync and >> evolve an entire database? That is amazing. >> >> >> >> If we could do this with Iceberg, we might be able to make an argument to >> product managers to push for CDC. >> >> >> >> >> >> >> >> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos <ipolyzos...@gmail.com> >> wrote: >> >> I’m curious if there is any reason for choosing Iceberg instead of Paimon >> (other than - iceberg is more popular). >> >> Especially for a use case like CDC that iceberg struggles to support. >> >> >> >> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto <o...@wikimedia.org> wrote: >> >> Interesting thank you! >> >> >> >> I asked this in the Paimon users group: >> >> >> >> How coupled to Paimon catalogs and tables is the cdc part of Paimon? >> RichCdcMultiplexRecord >> <https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java> >> and >> related code seem incredibly useful even outside of the context of the >> Paimon table format. >> >> >> >> I'm asking because the database sync action >> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases> >> feature >> is amazing. At the Wikimedia Foundation, we are on an all-in journey with >> Iceberg. I'm wondering how hard it would be to extract the CDC logic from >> Paimon and abstract the Sink bits. >> >> >> >> Could the table/database sync with schema evolution (without Flink job >> restarts!) potentially work with the Iceberg sink? >> >> >> >> >> >> >> >> >> >> On Thu, May 23, 2024 at 4:34 PM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the >> wire which contain not only the data, but the schema as well. >> >> With Iceberg we currently only send the row data, and expect to receive >> the schema on job start - this is more performant than sending the schema >> all the time, but has the obvious issue that it is not able to handle the >> schema changes. Another part of the dynamic schema synchronization is the >> update of the Iceberg table schema - the schema should be updated for all >> of the writers and the committer / but only a single schema change commit >> is needed (allowed) to the Iceberg table. >> >> >> >> This is a very interesting, but non-trivial change. >> >> >> >> [1] >> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java >> >> >> >> Andrew Otto <o...@wikimedia.org> ezt írta (időpont: 2024. máj. 23., Cs, >> 21:59): >> >> Ah I see, so just auto-restarting to pick up new stuff. >> >> >> >> I'd love to understand how Paimon does this. They have a database sync >> action >> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases> >> which will sync entire databases, handle schema evolution, and I'm pretty >> sure (I think I saw this in my local test) also pick up new tables. >> >> >> >> >> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45 >> >> >> >> I'm sure that Paimon table format is great, but at Wikimedia Foundation >> we are on the Iceberg train. Imagine if there was a flink-cdc full >> database sync to Flink IcebergSink! >> >> >> >> >> >> >> >> >> >> On Thu, May 23, 2024 at 3:47 PM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >> I will ask Marton about the slides. >> >> >> >> The solution was something like this in a nutshell: >> >> - Make sure that on job start the latest Iceberg schema is read from the >> Iceberg table >> >> - Throw a SuppressRestartsException when data arrives with the wrong >> schema >> >> - Use Flink Kubernetes Operator to restart your failed jobs by setting >> >> kubernetes.operator.job.restart.failed >> >> >> >> Thanks, Peter >> >> >> >> On Thu, May 23, 2024, 20:29 Andrew Otto <o...@wikimedia.org> wrote: >> >> Wow, I would LOVE to see this talk. If there is no recording, perhaps >> there are slides somewhere? >> >> >> >> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda < >> sanabria.miranda.car...@gmail.com> wrote: >> >> Hi everyone! >> >> >> >> I have found in the Flink Forward website the following presentation: >> "Self-service >> ingestion pipelines with evolving schema via Flink and Iceberg >> <https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>" >> by Márton Balassi from the 2023 conference in Seattle, but I cannot find >> the recording anywhere. I have found the recordings of the other >> presentations in the Ververica Academy website >> <https://www.ververica.academy/app>, but not this one. >> >> >> >> Does anyone know where I can find it? Or at least the slides? >> >> >> >> We are using Flink with the Iceberg sink connector to write streaming >> events to Iceberg tables, and we are researching how to handle schema >> evolution properly. I saw that presentation and I thought it could be of >> great help to us. >> >> >> >> Thanks in advance! >> >> >> >> Carlos >> >>