Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

Giannis Polyzos Fri, 24 May 2024 05:58:42 -0700

https://nightlies.apache.org/flink/flink-cdc-docs-stable/
All these features come from Flink cdc itself. Because Paimon and Flink cdc
are projects native to Flink there is a strong integration between them.
(I believe it’s on the roadmap to support iceberg as well)


On Fri, 24 May 2024 at 3:52 PM, Andrew Otto <o...@wikimedia.org> wrote:

> > I’m curious if there is any reason for choosing Iceberg instead of
> Paimon
>
> No technical reason that I'm aware of.  We are using it mostly because of
> momentum.  We looked at Flink Table Store (before it was Paimon), but
> decided it was too early and the docs were too sparse at the time to really
> consider it.
>
> > Especially for a use case like CDC that iceberg struggles to support.
>
> We aren't doing any CDC right now (for many reasons), but I have never
> seen a feature like Paimon's database sync before.  One job to sync and
> evolve an entire database?  That is amazing.
>
> If we could do this with Iceberg, we might be able to make an argument to
> product managers to push for CDC.
>
>
>
> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos <ipolyzos...@gmail.com>
> wrote:
>
>> I’m curious if there is any reason for choosing Iceberg instead of Paimon
>> (other than - iceberg is more popular).
>> Especially for a use case like CDC that iceberg struggles to support.
>>
>> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto <o...@wikimedia.org> wrote:
>>
>>> Interesting thank you!
>>>
>>> I asked this in the Paimon users group:
>>>
>>> How coupled to Paimon catalogs and tables is the cdc part of Paimon?
>>> RichCdcMultiplexRecord
>>> <https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java>
>>>  and
>>> related code seem incredibly useful even outside of the context of the
>>> Paimon table format.
>>>
>>> I'm asking because the database sync action
>>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
>>>  feature
>>> is amazing.  At the Wikimedia Foundation, we are on an all-in journey with
>>> Iceberg.  I'm wondering how hard it would be to extract the CDC logic from
>>> Paimon and abstract the Sink bits.
>>>
>>> Could the table/database sync with schema evolution (without Flink job
>>> restarts!) potentially work with the Iceberg sink?
>>>
>>>
>>>
>>>
>>> On Thu, May 23, 2024 at 4:34 PM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the
>>>> wire which contain not only the data, but the schema as well.
>>>> With Iceberg we currently only send the row data, and expect to receive
>>>> the schema on job start - this is more performant than sending the schema
>>>> all the time, but has the obvious issue that it is not able to handle the
>>>> schema changes. Another part of the dynamic schema synchronization is the
>>>> update of the Iceberg table schema - the schema should be updated for all
>>>> of the writers and the committer / but only a single schema change commit
>>>> is needed (allowed) to the Iceberg table.
>>>>
>>>> This is a very interesting, but non-trivial change.
>>>>
>>>> [1]
>>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java
>>>>
>>>> Andrew Otto <o...@wikimedia.org> ezt írta (időpont: 2024. máj. 23.,
>>>> Cs, 21:59):
>>>>
>>>>> Ah I see, so just auto-restarting to pick up new stuff.
>>>>>
>>>>> I'd love to understand how Paimon does this.  They have a database
>>>>> sync action
>>>>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
>>>>> which will sync entire databases, handle schema evolution, and I'm pretty
>>>>> sure (I think I saw this in my local test) also pick up new tables.
>>>>>
>>>>>
>>>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45
>>>>>
>>>>> I'm sure that Paimon table format is great, but at Wikimedia
>>>>> Foundation we are on the Iceberg train.  Imagine if there was a flink-cdc
>>>>> full database sync to Flink IcebergSink!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 23, 2024 at 3:47 PM Péter Váry <
>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>
>>>>>> I will ask Marton about the slides.
>>>>>>
>>>>>> The solution was something like this in a nutshell:
>>>>>> - Make sure that on job start the latest Iceberg schema is read from
>>>>>> the Iceberg table
>>>>>> - Throw a SuppressRestartsException when data arrives with the wrong
>>>>>> schema
>>>>>> - Use Flink Kubernetes Operator to restart your failed jobs by
>>>>>> setting
>>>>>> kubernetes.operator.job.restart.failed
>>>>>>
>>>>>> Thanks, Peter
>>>>>>
>>>>>> On Thu, May 23, 2024, 20:29 Andrew Otto <o...@wikimedia.org> wrote:
>>>>>>
>>>>>>> Wow, I would LOVE to see this talk.  If there is no recording,
>>>>>>> perhaps there are slides somewhere?
>>>>>>>
>>>>>>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
>>>>>>> sanabria.miranda.car...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi everyone!
>>>>>>>>
>>>>>>>> I have found in the Flink Forward website the following
>>>>>>>> presentation: "Self-service ingestion pipelines with evolving
>>>>>>>> schema via Flink and Iceberg
>>>>>>>> <https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>"
>>>>>>>> by Márton Balassi from the 2023 conference in Seattle, but I cannot 
>>>>>>>> find
>>>>>>>> the recording anywhere. I have found the recordings of the other
>>>>>>>> presentations in the Ververica Academy website
>>>>>>>> <https://www.ververica.academy/app>, but not this one.
>>>>>>>>
>>>>>>>> Does anyone know where I can find it? Or at least the slides?
>>>>>>>>
>>>>>>>> We are using Flink with the Iceberg sink connector to write
>>>>>>>> streaming events to Iceberg tables, and we are researching how to 
>>>>>>>> handle
>>>>>>>> schema evolution properly. I saw that presentation and I thought it 
>>>>>>>> could
>>>>>>>> be of great help to us.
>>>>>>>>
>>>>>>>> Thanks in advance!
>>>>>>>>
>>>>>>>> Carlos
>>>>>>>>
>>>>>>>

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

Reply via email to