I’m curious if there is any reason for choosing Iceberg instead of Paimon (other than - iceberg is more popular). Especially for a use case like CDC that iceberg struggles to support.
On Fri, 24 May 2024 at 3:22 PM, Andrew Otto <o...@wikimedia.org> wrote: > Interesting thank you! > > I asked this in the Paimon users group: > > How coupled to Paimon catalogs and tables is the cdc part of Paimon? > RichCdcMultiplexRecord > <https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java> > and > related code seem incredibly useful even outside of the context of the > Paimon table format. > > I'm asking because the database sync action > <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases> > feature > is amazing. At the Wikimedia Foundation, we are on an all-in journey with > Iceberg. I'm wondering how hard it would be to extract the CDC logic from > Paimon and abstract the Sink bits. > > Could the table/database sync with schema evolution (without Flink job > restarts!) potentially work with the Iceberg sink? > > > > > On Thu, May 23, 2024 at 4:34 PM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the >> wire which contain not only the data, but the schema as well. >> With Iceberg we currently only send the row data, and expect to receive >> the schema on job start - this is more performant than sending the schema >> all the time, but has the obvious issue that it is not able to handle the >> schema changes. Another part of the dynamic schema synchronization is the >> update of the Iceberg table schema - the schema should be updated for all >> of the writers and the committer / but only a single schema change commit >> is needed (allowed) to the Iceberg table. >> >> This is a very interesting, but non-trivial change. >> >> [1] >> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java >> >> Andrew Otto <o...@wikimedia.org> ezt írta (időpont: 2024. máj. 23., Cs, >> 21:59): >> >>> Ah I see, so just auto-restarting to pick up new stuff. >>> >>> I'd love to understand how Paimon does this. They have a database sync >>> action >>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases> >>> which will sync entire databases, handle schema evolution, and I'm pretty >>> sure (I think I saw this in my local test) also pick up new tables. >>> >>> >>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45 >>> >>> I'm sure that Paimon table format is great, but at Wikimedia Foundation >>> we are on the Iceberg train. Imagine if there was a flink-cdc full >>> database sync to Flink IcebergSink! >>> >>> >>> >>> >>> On Thu, May 23, 2024 at 3:47 PM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> I will ask Marton about the slides. >>>> >>>> The solution was something like this in a nutshell: >>>> - Make sure that on job start the latest Iceberg schema is read from >>>> the Iceberg table >>>> - Throw a SuppressRestartsException when data arrives with the wrong >>>> schema >>>> - Use Flink Kubernetes Operator to restart your failed jobs by setting >>>> kubernetes.operator.job.restart.failed >>>> >>>> Thanks, Peter >>>> >>>> On Thu, May 23, 2024, 20:29 Andrew Otto <o...@wikimedia.org> wrote: >>>> >>>>> Wow, I would LOVE to see this talk. If there is no recording, perhaps >>>>> there are slides somewhere? >>>>> >>>>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda < >>>>> sanabria.miranda.car...@gmail.com> wrote: >>>>> >>>>>> Hi everyone! >>>>>> >>>>>> I have found in the Flink Forward website the following presentation: >>>>>> "Self-service ingestion pipelines with evolving schema via Flink and >>>>>> Iceberg >>>>>> <https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>" >>>>>> by Márton Balassi from the 2023 conference in Seattle, but I cannot find >>>>>> the recording anywhere. I have found the recordings of the other >>>>>> presentations in the Ververica Academy website >>>>>> <https://www.ververica.academy/app>, but not this one. >>>>>> >>>>>> Does anyone know where I can find it? Or at least the slides? >>>>>> >>>>>> We are using Flink with the Iceberg sink connector to write streaming >>>>>> events to Iceberg tables, and we are researching how to handle schema >>>>>> evolution properly. I saw that presentation and I thought it could be of >>>>>> great help to us. >>>>>> >>>>>> Thanks in advance! >>>>>> >>>>>> Carlos >>>>>> >>>>>