https://nightlies.apache.org/flink/flink-cdc-docs-stable/ All these features come from Flink cdc itself. Because Paimon and Flink cdc are projects native to Flink there is a strong integration between them. (I believe it’s on the roadmap to support iceberg as well)
On Fri, 24 May 2024 at 3:52 PM, Andrew Otto <o...@wikimedia.org> wrote: > > I’m curious if there is any reason for choosing Iceberg instead of > Paimon > > No technical reason that I'm aware of. We are using it mostly because of > momentum. We looked at Flink Table Store (before it was Paimon), but > decided it was too early and the docs were too sparse at the time to really > consider it. > > > Especially for a use case like CDC that iceberg struggles to support. > > We aren't doing any CDC right now (for many reasons), but I have never > seen a feature like Paimon's database sync before. One job to sync and > evolve an entire database? That is amazing. > > If we could do this with Iceberg, we might be able to make an argument to > product managers to push for CDC. > > > > On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos <ipolyzos...@gmail.com> > wrote: > >> I’m curious if there is any reason for choosing Iceberg instead of Paimon >> (other than - iceberg is more popular). >> Especially for a use case like CDC that iceberg struggles to support. >> >> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto <o...@wikimedia.org> wrote: >> >>> Interesting thank you! >>> >>> I asked this in the Paimon users group: >>> >>> How coupled to Paimon catalogs and tables is the cdc part of Paimon? >>> RichCdcMultiplexRecord >>> <https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java> >>> and >>> related code seem incredibly useful even outside of the context of the >>> Paimon table format. >>> >>> I'm asking because the database sync action >>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases> >>> feature >>> is amazing. At the Wikimedia Foundation, we are on an all-in journey with >>> Iceberg. I'm wondering how hard it would be to extract the CDC logic from >>> Paimon and abstract the Sink bits. >>> >>> Could the table/database sync with schema evolution (without Flink job >>> restarts!) potentially work with the Iceberg sink? >>> >>> >>> >>> >>> On Thu, May 23, 2024 at 4:34 PM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the >>>> wire which contain not only the data, but the schema as well. >>>> With Iceberg we currently only send the row data, and expect to receive >>>> the schema on job start - this is more performant than sending the schema >>>> all the time, but has the obvious issue that it is not able to handle the >>>> schema changes. Another part of the dynamic schema synchronization is the >>>> update of the Iceberg table schema - the schema should be updated for all >>>> of the writers and the committer / but only a single schema change commit >>>> is needed (allowed) to the Iceberg table. >>>> >>>> This is a very interesting, but non-trivial change. >>>> >>>> [1] >>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java >>>> >>>> Andrew Otto <o...@wikimedia.org> ezt írta (időpont: 2024. máj. 23., >>>> Cs, 21:59): >>>> >>>>> Ah I see, so just auto-restarting to pick up new stuff. >>>>> >>>>> I'd love to understand how Paimon does this. They have a database >>>>> sync action >>>>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases> >>>>> which will sync entire databases, handle schema evolution, and I'm pretty >>>>> sure (I think I saw this in my local test) also pick up new tables. >>>>> >>>>> >>>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45 >>>>> >>>>> I'm sure that Paimon table format is great, but at Wikimedia >>>>> Foundation we are on the Iceberg train. Imagine if there was a flink-cdc >>>>> full database sync to Flink IcebergSink! >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, May 23, 2024 at 3:47 PM Péter Váry < >>>>> peter.vary.apa...@gmail.com> wrote: >>>>> >>>>>> I will ask Marton about the slides. >>>>>> >>>>>> The solution was something like this in a nutshell: >>>>>> - Make sure that on job start the latest Iceberg schema is read from >>>>>> the Iceberg table >>>>>> - Throw a SuppressRestartsException when data arrives with the wrong >>>>>> schema >>>>>> - Use Flink Kubernetes Operator to restart your failed jobs by >>>>>> setting >>>>>> kubernetes.operator.job.restart.failed >>>>>> >>>>>> Thanks, Peter >>>>>> >>>>>> On Thu, May 23, 2024, 20:29 Andrew Otto <o...@wikimedia.org> wrote: >>>>>> >>>>>>> Wow, I would LOVE to see this talk. If there is no recording, >>>>>>> perhaps there are slides somewhere? >>>>>>> >>>>>>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda < >>>>>>> sanabria.miranda.car...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi everyone! >>>>>>>> >>>>>>>> I have found in the Flink Forward website the following >>>>>>>> presentation: "Self-service ingestion pipelines with evolving >>>>>>>> schema via Flink and Iceberg >>>>>>>> <https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>" >>>>>>>> by Márton Balassi from the 2023 conference in Seattle, but I cannot >>>>>>>> find >>>>>>>> the recording anywhere. I have found the recordings of the other >>>>>>>> presentations in the Ververica Academy website >>>>>>>> <https://www.ververica.academy/app>, but not this one. >>>>>>>> >>>>>>>> Does anyone know where I can find it? Or at least the slides? >>>>>>>> >>>>>>>> We are using Flink with the Iceberg sink connector to write >>>>>>>> streaming events to Iceberg tables, and we are researching how to >>>>>>>> handle >>>>>>>> schema evolution properly. I saw that presentation and I thought it >>>>>>>> could >>>>>>>> be of great help to us. >>>>>>>> >>>>>>>> Thanks in advance! >>>>>>>> >>>>>>>> Carlos >>>>>>>> >>>>>>>