Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

Andrew Otto Fri, 24 May 2024 05:22:38 -0700

Interesting thank you!

I asked this in the Paimon users group:


How coupled to Paimon catalogs and tables is the cdc part of Paimon?
RichCdcMultiplexRecord
<https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java>
and
related code seem incredibly useful even outside of the context of the
Paimon table format.

I'm asking because the database sync action
<https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
feature
is amazing.  At the Wikimedia Foundation, we are on an all-in journey with
Iceberg.  I'm wondering how hard it would be to extract the CDC logic from
Paimon and abstract the Sink bits.

Could the table/database sync with schema evolution (without Flink job
restarts!) potentially work with the Iceberg sink?




On Thu, May 23, 2024 at 4:34 PM Péter Váry <[email protected]>
wrote:

> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the wire
> which contain not only the data, but the schema as well.
> With Iceberg we currently only send the row data, and expect to receive
> the schema on job start - this is more performant than sending the schema
> all the time, but has the obvious issue that it is not able to handle the
> schema changes. Another part of the dynamic schema synchronization is the
> update of the Iceberg table schema - the schema should be updated for all
> of the writers and the committer / but only a single schema change commit
> is needed (allowed) to the Iceberg table.
>
> This is a very interesting, but non-trivial change.
>
> [1]
> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java
>
> Andrew Otto <[email protected]> ezt írta (időpont: 2024. máj. 23., Cs,
> 21:59):
>
>> Ah I see, so just auto-restarting to pick up new stuff.
>>
>> I'd love to understand how Paimon does this.  They have a database sync
>> action
>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
>> which will sync entire databases, handle schema evolution, and I'm pretty
>> sure (I think I saw this in my local test) also pick up new tables.
>>
>>
>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45
>>
>> I'm sure that Paimon table format is great, but at Wikimedia Foundation
>> we are on the Iceberg train.  Imagine if there was a flink-cdc full
>> database sync to Flink IcebergSink!
>>
>>
>>
>>
>> On Thu, May 23, 2024 at 3:47 PM Péter Váry <[email protected]>
>> wrote:
>>
>>> I will ask Marton about the slides.
>>>
>>> The solution was something like this in a nutshell:
>>> - Make sure that on job start the latest Iceberg schema is read from the
>>> Iceberg table
>>> - Throw a SuppressRestartsException when data arrives with the wrong
>>> schema
>>> - Use Flink Kubernetes Operator to restart your failed jobs by setting
>>> kubernetes.operator.job.restart.failed
>>>
>>> Thanks, Peter
>>>
>>> On Thu, May 23, 2024, 20:29 Andrew Otto <[email protected]> wrote:
>>>
>>>> Wow, I would LOVE to see this talk.  If there is no recording, perhaps
>>>> there are slides somewhere?
>>>>
>>>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi everyone!
>>>>>
>>>>> I have found in the Flink Forward website the following presentation: 
>>>>> "Self-service
>>>>> ingestion pipelines with evolving schema via Flink and Iceberg
>>>>> <https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>"
>>>>> by Márton Balassi from the 2023 conference in Seattle, but I cannot find
>>>>> the recording anywhere. I have found the recordings of the other
>>>>> presentations in the Ververica Academy website
>>>>> <https://www.ververica.academy/app>, but not this one.
>>>>>
>>>>> Does anyone know where I can find it? Or at least the slides?
>>>>>
>>>>> We are using Flink with the Iceberg sink connector to write streaming
>>>>> events to Iceberg tables, and we are researching how to handle schema
>>>>> evolution properly. I saw that presentation and I thought it could be of
>>>>> great help to us.
>>>>>
>>>>> Thanks in advance!
>>>>>
>>>>> Carlos
>>>>>
>>>>

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

Reply via email to