Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

Péter Váry Thu, 23 May 2024 13:35:23 -0700

If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the wire
which contain not only the data, but the schema as well.
With Iceberg we currently only send the row data, and expect to receive the
schema on job start - this is more performant than sending the schema all
the time, but has the obvious issue that it is not able to handle the
schema changes. Another part of the dynamic schema synchronization is the
update of the Iceberg table schema - the schema should be updated for all
of the writers and the committer / but only a single schema change commit
is needed (allowed) to the Iceberg table.


This is a very interesting, but non-trivial change.

[1]
https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java

Andrew Otto <o...@wikimedia.org> ezt írta (időpont: 2024. máj. 23., Cs,
21:59):

> Ah I see, so just auto-restarting to pick up new stuff.
>
> I'd love to understand how Paimon does this.  They have a database sync
> action
> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
> which will sync entire databases, handle schema evolution, and I'm pretty
> sure (I think I saw this in my local test) also pick up new tables.
>
>
> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45
>
> I'm sure that Paimon table format is great, but at Wikimedia Foundation we
> are on the Iceberg train.  Imagine if there was a flink-cdc full database
> sync to Flink IcebergSink!
>
>
>
>
> On Thu, May 23, 2024 at 3:47 PM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> I will ask Marton about the slides.
>>
>> The solution was something like this in a nutshell:
>> - Make sure that on job start the latest Iceberg schema is read from the
>> Iceberg table
>> - Throw a SuppressRestartsException when data arrives with the wrong
>> schema
>> - Use Flink Kubernetes Operator to restart your failed jobs by setting
>> kubernetes.operator.job.restart.failed
>>
>> Thanks, Peter
>>
>> On Thu, May 23, 2024, 20:29 Andrew Otto <o...@wikimedia.org> wrote:
>>
>>> Wow, I would LOVE to see this talk.  If there is no recording, perhaps
>>> there are slides somewhere?
>>>
>>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
>>> sanabria.miranda.car...@gmail.com> wrote:
>>>
>>>> Hi everyone!
>>>>
>>>> I have found in the Flink Forward website the following presentation: 
>>>> "Self-service
>>>> ingestion pipelines with evolving schema via Flink and Iceberg
>>>> <https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>"
>>>> by Márton Balassi from the 2023 conference in Seattle, but I cannot find
>>>> the recording anywhere. I have found the recordings of the other
>>>> presentations in the Ververica Academy website
>>>> <https://www.ververica.academy/app>, but not this one.
>>>>
>>>> Does anyone know where I can find it? Or at least the slides?
>>>>
>>>> We are using Flink with the Iceberg sink connector to write streaming
>>>> events to Iceberg tables, and we are researching how to handle schema
>>>> evolution properly. I saw that presentation and I thought it could be of
>>>> great help to us.
>>>>
>>>> Thanks in advance!
>>>>
>>>> Carlos
>>>>
>>>

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

Reply via email to