Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

Andrew Otto Tue, 28 May 2024 06:55:10 -0700

> Flink CDC [1] now provides full-DB sync and schema evolution ability as a
pipeline job.


Ah!  That is very cool.


> Iceberg sink support was suggested before, and we’re trying to implement
this in the next few releases. Does it cover the use-cases you mentioned?


Yes!  That would be fantastic.




On Sun, May 26, 2024 at 11:56 PM Hang Ruan <[email protected]> wrote:

> Hi, all.
>
> Flink CDC provides the schema evolution ability to sync the entire
> database. I think it could satisfy your needs.
> Flink CDC pipeline sources and sinks are listed in [1]. Iceberg pipeline
> connector is not provided by now.
>
> > What is not is the automatic syncing of entire databases, with schema
> evolution and detection of new (and dropped?) tables. :)
>
> Flink CDC is able to sync the entire database with schema evolutions. If a
> new table is added to this database, the running pipeline job cannot sync
> it.
> But we could enable 'scan.newly-added-table.enabled' and restart this job
> with a savepoint to catch the new tables.
> This feature for MySQL pipeline connector is not released now. But the
> PR[2] has been provided.
>
> Best,
> Hang
>
> [1]
> https://nightlies.apache.org/flink/flink-cdc-docs-release-3.1/docs/connectors/pipeline-connectors/overview/
> [2] https://github.com/apache/flink-cdc/pull/3347
>
> Xiqian YU <[email protected]> 于2024年5月27日周一 10:04写道：
>
>> Hi Otto,
>>
>>
>>
>> Flink CDC [1] now provides full-DB sync and schema evolution ability as a
>> pipeline job. Iceberg sink support was suggested before, and we’re trying
>> to implement this in the next few releases. Does it cover the use-cases you
>> mentioned?
>>
>>
>>
>> [1] https://nightlies.apache.org/flink/flink-cdc-docs-stable/
>>
>> [2] https://issues.apache.org/jira/browse/FLINK-34840
>>
>>
>>
>> Regards,
>>
>> Xiqian
>>
>>
>>
>>
>>
>> *De : *Andrew Otto <[email protected]>
>> *Date : *vendredi, 24 mai 2024 à 23:06
>> *À : *Giannis Polyzos <[email protected]>
>> *Cc : *Carlos Sanabria Miranda <[email protected]>,
>> Oscar Perez via user <[email protected]>, Péter Váry <
>> [email protected]>, [email protected] <[email protected]>
>> *Objet : *Re: "Self-service ingestion pipelines with evolving schema via
>> Flink and Iceberg" presentation recording from Flink Forward Seattle 2023
>>
>> Indeed, using Flink-CDC to write to Flink Sink Tables, including Iceberg,
>> is supported.
>>
>>
>>
>> What is not is the automatic syncing of entire databases, with schema
>> evolution and detection of new (and dropped?) tables.  :)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos <[email protected]>
>> wrote:
>>
>> https://nightlies.apache.org/flink/flink-cdc-docs-stable/
>>
>> All these features come from Flink cdc itself. Because Paimon and Flink
>> cdc are projects native to Flink there is a strong integration between them.
>>
>> (I believe it’s on the roadmap to support iceberg as well)
>>
>>
>>
>> On Fri, 24 May 2024 at 3:52 PM, Andrew Otto <[email protected]> wrote:
>>
>> > I’m curious if there is any reason for choosing Iceberg instead of
>> Paimon
>>
>>
>> No technical reason that I'm aware of.  We are using it mostly because of
>> momentum.  We looked at Flink Table Store (before it was Paimon), but
>> decided it was too early and the docs were too sparse at the time to really
>> consider it.
>>
>>
>>
>> > Especially for a use case like CDC that iceberg struggles to support.
>>
>>
>>
>> We aren't doing any CDC right now (for many reasons), but I have never
>> seen a feature like Paimon's database sync before.  One job to sync and
>> evolve an entire database?  That is amazing.
>>
>>
>>
>> If we could do this with Iceberg, we might be able to make an argument to
>> product managers to push for CDC.
>>
>>
>>
>>
>>
>>
>>
>> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos <[email protected]>
>> wrote:
>>
>> I’m curious if there is any reason for choosing Iceberg instead of Paimon
>> (other than - iceberg is more popular).
>>
>> Especially for a use case like CDC that iceberg struggles to support.
>>
>>
>>
>> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto <[email protected]> wrote:
>>
>> Interesting thank you!
>>
>>
>>
>> I asked this in the Paimon users group:
>>
>>
>>
>> How coupled to Paimon catalogs and tables is the cdc part of Paimon?
>> RichCdcMultiplexRecord
>> <https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java>
>>  and
>> related code seem incredibly useful even outside of the context of the
>> Paimon table format.
>>
>>
>>
>> I'm asking because the database sync action
>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
>>  feature
>> is amazing.  At the Wikimedia Foundation, we are on an all-in journey with
>> Iceberg.  I'm wondering how hard it would be to extract the CDC logic from
>> Paimon and abstract the Sink bits.
>>
>>
>>
>> Could the table/database sync with schema evolution (without Flink job
>> restarts!) potentially work with the Iceberg sink?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, May 23, 2024 at 4:34 PM Péter Váry <[email protected]>
>> wrote:
>>
>> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the
>> wire which contain not only the data, but the schema as well.
>>
>> With Iceberg we currently only send the row data, and expect to receive
>> the schema on job start - this is more performant than sending the schema
>> all the time, but has the obvious issue that it is not able to handle the
>> schema changes. Another part of the dynamic schema synchronization is the
>> update of the Iceberg table schema - the schema should be updated for all
>> of the writers and the committer / but only a single schema change commit
>> is needed (allowed) to the Iceberg table.
>>
>>
>>
>> This is a very interesting, but non-trivial change.
>>
>>
>>
>> [1]
>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java
>>
>>
>>
>> Andrew Otto <[email protected]> ezt írta (időpont: 2024. máj. 23., Cs,
>> 21:59):
>>
>> Ah I see, so just auto-restarting to pick up new stuff.
>>
>>
>>
>> I'd love to understand how Paimon does this.  They have a database sync
>> action
>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
>> which will sync entire databases, handle schema evolution, and I'm pretty
>> sure (I think I saw this in my local test) also pick up new tables.
>>
>>
>>
>>
>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45
>>
>>
>>
>> I'm sure that Paimon table format is great, but at Wikimedia Foundation
>> we are on the Iceberg train.  Imagine if there was a flink-cdc full
>> database sync to Flink IcebergSink!
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, May 23, 2024 at 3:47 PM Péter Váry <[email protected]>
>> wrote:
>>
>> I will ask Marton about the slides.
>>
>>
>>
>> The solution was something like this in a nutshell:
>>
>> - Make sure that on job start the latest Iceberg schema is read from the
>> Iceberg table
>>
>> - Throw a SuppressRestartsException when data arrives with the wrong
>> schema
>>
>> - Use Flink Kubernetes Operator to restart your failed jobs by setting
>>
>> kubernetes.operator.job.restart.failed
>>
>>
>>
>> Thanks, Peter
>>
>>
>>
>> On Thu, May 23, 2024, 20:29 Andrew Otto <[email protected]> wrote:
>>
>> Wow, I would LOVE to see this talk.  If there is no recording, perhaps
>> there are slides somewhere?
>>
>>
>>
>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
>> [email protected]> wrote:
>>
>> Hi everyone!
>>
>>
>>
>> I have found in the Flink Forward website the following presentation: 
>> "Self-service
>> ingestion pipelines with evolving schema via Flink and Iceberg
>> <https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>"
>> by Márton Balassi from the 2023 conference in Seattle, but I cannot find
>> the recording anywhere. I have found the recordings of the other
>> presentations in the Ververica Academy website
>> <https://www.ververica.academy/app>, but not this one.
>>
>>
>>
>> Does anyone know where I can find it? Or at least the slides?
>>
>>
>>
>> We are using Flink with the Iceberg sink connector to write streaming
>> events to Iceberg tables, and we are researching how to handle schema
>> evolution properly. I saw that presentation and I thought it could be of
>> great help to us.
>>
>>
>>
>> Thanks in advance!
>>
>>
>>
>> Carlos
>>
>>

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

Reply via email to