Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

gongzhongqiang Sun, 26 May 2024 20:28:55 -0700

Flink CDC 3.0 focuses on data integration scenarios, so you don't need to
pay attention to the framework implementation, you just need to use the
YAML format to describe the data source and target to quickly build a data
synchronization task with schema evolution.And it supports rich source and
sink.


So I think add pipeline sink connector for iceberg will be the better way
to go.We can rely on the Flink CDC 3.0 framework to connect source and
target easily.

Best,
Zhongqiang Gong


Péter Váry <peter.vary.apa...@gmail.com> 于2024年5月25日周六 19:29写道：

> > Could the table/database sync with schema evolution (without Flink job
> restarts!) potentially work with the Iceberg sink?
>
> Making  this work would be a good addition to the Iceberg-Flink connector.
> It is definitely doable, but not a single PR sized task. If you want to try
> your hands on it, I will try to find time to review your plans/code, so
> your code could be incorporated into the upcoming releases.
>
> Thanks,
> Peter
>
>
>
> On Fri, May 24, 2024, 17:07 Andrew Otto <o...@wikimedia.org> wrote:
>
>> > What is not is the automatic syncing of entire databases, with schema
>> evolution and detection of new (and dropped?) tables. :)
>> Wait.  Is it?
>>
>> > Flink CDC supports synchronizing all tables of source database
>> instance to downstream in one job by configuring the captured database list
>> and table list.
>>
>>
>> On Fri, May 24, 2024 at 11:04 AM Andrew Otto <o...@wikimedia.org> wrote:
>>
>>> Indeed, using Flink-CDC to write to Flink Sink Tables, including
>>> Iceberg, is supported.
>>>
>>> What is not is the automatic syncing of entire databases, with schema
>>> evolution and detection of new (and dropped?) tables.  :)
>>>
>>>
>>>
>>>
>>> On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos <ipolyzos...@gmail.com>
>>> wrote:
>>>
>>>> https://nightlies.apache.org/flink/flink-cdc-docs-stable/
>>>> All these features come from Flink cdc itself. Because Paimon and Flink
>>>> cdc are projects native to Flink there is a strong integration between 
>>>> them.
>>>> (I believe it’s on the roadmap to support iceberg as well)
>>>>
>>>> On Fri, 24 May 2024 at 3:52 PM, Andrew Otto <o...@wikimedia.org> wrote:
>>>>
>>>>> > I’m curious if there is any reason for choosing Iceberg instead of
>>>>> Paimon
>>>>>
>>>>> No technical reason that I'm aware of.  We are using it mostly because
>>>>> of momentum.  We looked at Flink Table Store (before it was Paimon), but
>>>>> decided it was too early and the docs were too sparse at the time to 
>>>>> really
>>>>> consider it.
>>>>>
>>>>> > Especially for a use case like CDC that iceberg struggles to
>>>>> support.
>>>>>
>>>>> We aren't doing any CDC right now (for many reasons), but I have never
>>>>> seen a feature like Paimon's database sync before.  One job to sync and
>>>>> evolve an entire database?  That is amazing.
>>>>>
>>>>> If we could do this with Iceberg, we might be able to make an argument
>>>>> to product managers to push for CDC.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos <ipolyzos...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I’m curious if there is any reason for choosing Iceberg instead of
>>>>>> Paimon (other than - iceberg is more popular).
>>>>>> Especially for a use case like CDC that iceberg struggles to support.
>>>>>>
>>>>>> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto <o...@wikimedia.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Interesting thank you!
>>>>>>>
>>>>>>> I asked this in the Paimon users group:
>>>>>>>
>>>>>>> How coupled to Paimon catalogs and tables is the cdc part of
>>>>>>> Paimon?  RichCdcMultiplexRecord
>>>>>>> <https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java>
>>>>>>>  and
>>>>>>> related code seem incredibly useful even outside of the context of the
>>>>>>> Paimon table format.
>>>>>>>
>>>>>>> I'm asking because the database sync action
>>>>>>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
>>>>>>>  feature
>>>>>>> is amazing.  At the Wikimedia Foundation, we are on an all-in journey 
>>>>>>> with
>>>>>>> Iceberg.  I'm wondering how hard it would be to extract the CDC logic 
>>>>>>> from
>>>>>>> Paimon and abstract the Sink bits.
>>>>>>>
>>>>>>> Could the table/database sync with schema evolution (without Flink
>>>>>>> job restarts!) potentially work with the Iceberg sink?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 23, 2024 at 4:34 PM Péter Váry <
>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>
>>>>>>>> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on
>>>>>>>> the wire which contain not only the data, but the schema as well.
>>>>>>>> With Iceberg we currently only send the row data, and expect to
>>>>>>>> receive the schema on job start - this is more performant than sending 
>>>>>>>> the
>>>>>>>> schema all the time, but has the obvious issue that it is not able to
>>>>>>>> handle the schema changes. Another part of the dynamic schema
>>>>>>>> synchronization is the update of the Iceberg table schema - the schema
>>>>>>>> should be updated for all of the writers and the committer / but only a
>>>>>>>> single schema change commit is needed (allowed) to the Iceberg table.
>>>>>>>>
>>>>>>>> This is a very interesting, but non-trivial change.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java
>>>>>>>>
>>>>>>>> Andrew Otto <o...@wikimedia.org> ezt írta (időpont: 2024. máj.
>>>>>>>> 23., Cs, 21:59):
>>>>>>>>
>>>>>>>>> Ah I see, so just auto-restarting to pick up new stuff.
>>>>>>>>>
>>>>>>>>> I'd love to understand how Paimon does this.  They have a database
>>>>>>>>> sync action
>>>>>>>>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
>>>>>>>>> which will sync entire databases, handle schema evolution, and I'm 
>>>>>>>>> pretty
>>>>>>>>> sure (I think I saw this in my local test) also pick up new tables.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45
>>>>>>>>>
>>>>>>>>> I'm sure that Paimon table format is great, but at Wikimedia
>>>>>>>>> Foundation we are on the Iceberg train.  Imagine if there was a 
>>>>>>>>> flink-cdc
>>>>>>>>> full database sync to Flink IcebergSink!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 23, 2024 at 3:47 PM Péter Váry <
>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I will ask Marton about the slides.
>>>>>>>>>>
>>>>>>>>>> The solution was something like this in a nutshell:
>>>>>>>>>> - Make sure that on job start the latest Iceberg schema is read
>>>>>>>>>> from the Iceberg table
>>>>>>>>>> - Throw a SuppressRestartsException when data arrives with the
>>>>>>>>>> wrong schema
>>>>>>>>>> - Use Flink Kubernetes Operator to restart your failed jobs by
>>>>>>>>>> setting
>>>>>>>>>> kubernetes.operator.job.restart.failed
>>>>>>>>>>
>>>>>>>>>> Thanks, Peter
>>>>>>>>>>
>>>>>>>>>> On Thu, May 23, 2024, 20:29 Andrew Otto <o...@wikimedia.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Wow, I would LOVE to see this talk.  If there is no recording,
>>>>>>>>>>> perhaps there are slides somewhere?
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
>>>>>>>>>>> sanabria.miranda.car...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>
>>>>>>>>>>>> I have found in the Flink Forward website the following
>>>>>>>>>>>> presentation: "Self-service ingestion pipelines with evolving
>>>>>>>>>>>> schema via Flink and Iceberg
>>>>>>>>>>>> <https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>"
>>>>>>>>>>>> by Márton Balassi from the 2023 conference in Seattle, but I 
>>>>>>>>>>>> cannot find
>>>>>>>>>>>> the recording anywhere. I have found the recordings of the other
>>>>>>>>>>>> presentations in the Ververica Academy website
>>>>>>>>>>>> <https://www.ververica.academy/app>, but not this one.
>>>>>>>>>>>>
>>>>>>>>>>>> Does anyone know where I can find it? Or at least the slides?
>>>>>>>>>>>>
>>>>>>>>>>>> We are using Flink with the Iceberg sink connector to write
>>>>>>>>>>>> streaming events to Iceberg tables, and we are researching how to 
>>>>>>>>>>>> handle
>>>>>>>>>>>> schema evolution properly. I saw that presentation and I thought 
>>>>>>>>>>>> it could
>>>>>>>>>>>> be of great help to us.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance!
>>>>>>>>>>>>
>>>>>>>>>>>> Carlos
>>>>>>>>>>>>
>>>>>>>>>>>

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

Reply via email to