Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-28 Thread Andrew Otto
> Flink CDC [1] now provides full-DB sync and schema evolution ability as a
pipeline job.

Ah!  That is very cool.


> Iceberg sink support was suggested before, and we’re trying to implement
this in the next few releases. Does it cover the use-cases you mentioned?


Yes!  That would be fantastic.




On Sun, May 26, 2024 at 11:56 PM Hang Ruan  wrote:

> Hi, all.
>
> Flink CDC provides the schema evolution ability to sync the entire
> database. I think it could satisfy your needs.
> Flink CDC pipeline sources and sinks are listed in [1]. Iceberg pipeline
> connector is not provided by now.
>
> > What is not is the automatic syncing of entire databases, with schema
> evolution and detection of new (and dropped?) tables. :)
>
> Flink CDC is able to sync the entire database with schema evolutions. If a
> new table is added to this database, the running pipeline job cannot sync
> it.
> But we could enable 'scan.newly-added-table.enabled' and restart this job
> with a savepoint to catch the new tables.
> This feature for MySQL pipeline connector is not released now. But the
> PR[2] has been provided.
>
> Best,
> Hang
>
> [1]
> https://nightlies.apache.org/flink/flink-cdc-docs-release-3.1/docs/connectors/pipeline-connectors/overview/
> [2] https://github.com/apache/flink-cdc/pull/3347
>
> Xiqian YU  于2024年5月27日周一 10:04写道:
>
>> Hi Otto,
>>
>>
>>
>> Flink CDC [1] now provides full-DB sync and schema evolution ability as a
>> pipeline job. Iceberg sink support was suggested before, and we’re trying
>> to implement this in the next few releases. Does it cover the use-cases you
>> mentioned?
>>
>>
>>
>> [1] https://nightlies.apache.org/flink/flink-cdc-docs-stable/
>>
>> [2] https://issues.apache.org/jira/browse/FLINK-34840
>>
>>
>>
>> Regards,
>>
>> Xiqian
>>
>>
>>
>>
>>
>> *De : *Andrew Otto 
>> *Date : *vendredi, 24 mai 2024 à 23:06
>> *À : *Giannis Polyzos 
>> *Cc : *Carlos Sanabria Miranda ,
>> Oscar Perez via user , Péter Váry <
>> peter.vary.apa...@gmail.com>, mbala...@apache.org 
>> *Objet : *Re: "Self-service ingestion pipelines with evolving schema via
>> Flink and Iceberg" presentation recording from Flink Forward Seattle 2023
>>
>> Indeed, using Flink-CDC to write to Flink Sink Tables, including Iceberg,
>> is supported.
>>
>>
>>
>> What is not is the automatic syncing of entire databases, with schema
>> evolution and detection of new (and dropped?) tables.  :)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos 
>> wrote:
>>
>> https://nightlies.apache.org/flink/flink-cdc-docs-stable/
>>
>> All these features come from Flink cdc itself. Because Paimon and Flink
>> cdc are projects native to Flink there is a strong integration between them.
>>
>> (I believe it’s on the roadmap to support iceberg as well)
>>
>>
>>
>> On Fri, 24 May 2024 at 3:52 PM, Andrew Otto  wrote:
>>
>> > I’m curious if there is any reason for choosing Iceberg instead of
>> Paimon
>>
>>
>> No technical reason that I'm aware of.  We are using it mostly because of
>> momentum.  We looked at Flink Table Store (before it was Paimon), but
>> decided it was too early and the docs were too sparse at the time to really
>> consider it.
>>
>>
>>
>> > Especially for a use case like CDC that iceberg struggles to support.
>>
>>
>>
>> We aren't doing any CDC right now (for many reasons), but I have never
>> seen a feature like Paimon's database sync before.  One job to sync and
>> evolve an entire database?  That is amazing.
>>
>>
>>
>> If we could do this with Iceberg, we might be able to make an argument to
>> product managers to push for CDC.
>>
>>
>>
>>
>>
>>
>>
>> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos 
>> wrote:
>>
>> I’m curious if there is any reason for choosing Iceberg instead of Paimon
>> (other than - iceberg is more popular).
>>
>> Especially for a use case like CDC that iceberg struggles to support.
>>
>>
>>
>> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto  wrote:
>>
>> Interesting thank you!
>>
>>
>>
>> I asked this in the Paimon users group:
>>
>>
>>
>> How coupled to Paimon catalogs and tables is the cdc part of Paimon?
>> RichCdcMultiplexRecord
>> <https://github.com

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-26 Thread Hang Ruan
Hi, all.

Flink CDC provides the schema evolution ability to sync the entire
database. I think it could satisfy your needs.
Flink CDC pipeline sources and sinks are listed in [1]. Iceberg pipeline
connector is not provided by now.

> What is not is the automatic syncing of entire databases, with schema
evolution and detection of new (and dropped?) tables. :)

Flink CDC is able to sync the entire database with schema evolutions. If a
new table is added to this database, the running pipeline job cannot sync
it.
But we could enable 'scan.newly-added-table.enabled' and restart this job
with a savepoint to catch the new tables.
This feature for MySQL pipeline connector is not released now. But the
PR[2] has been provided.

Best,
Hang

[1]
https://nightlies.apache.org/flink/flink-cdc-docs-release-3.1/docs/connectors/pipeline-connectors/overview/
[2] https://github.com/apache/flink-cdc/pull/3347

Xiqian YU  于2024年5月27日周一 10:04写道:

> Hi Otto,
>
>
>
> Flink CDC [1] now provides full-DB sync and schema evolution ability as a
> pipeline job. Iceberg sink support was suggested before, and we’re trying
> to implement this in the next few releases. Does it cover the use-cases you
> mentioned?
>
>
>
> [1] https://nightlies.apache.org/flink/flink-cdc-docs-stable/
>
> [2] https://issues.apache.org/jira/browse/FLINK-34840
>
>
>
> Regards,
>
> Xiqian
>
>
>
>
>
> *De : *Andrew Otto 
> *Date : *vendredi, 24 mai 2024 à 23:06
> *À : *Giannis Polyzos 
> *Cc : *Carlos Sanabria Miranda , Oscar
> Perez via user , Péter Váry <
> peter.vary.apa...@gmail.com>, mbala...@apache.org 
> *Objet : *Re: "Self-service ingestion pipelines with evolving schema via
> Flink and Iceberg" presentation recording from Flink Forward Seattle 2023
>
> Indeed, using Flink-CDC to write to Flink Sink Tables, including Iceberg,
> is supported.
>
>
>
> What is not is the automatic syncing of entire databases, with schema
> evolution and detection of new (and dropped?) tables.  :)
>
>
>
>
>
>
>
>
>
> On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos 
> wrote:
>
> https://nightlies.apache.org/flink/flink-cdc-docs-stable/
>
> All these features come from Flink cdc itself. Because Paimon and Flink
> cdc are projects native to Flink there is a strong integration between them.
>
> (I believe it’s on the roadmap to support iceberg as well)
>
>
>
> On Fri, 24 May 2024 at 3:52 PM, Andrew Otto  wrote:
>
> > I’m curious if there is any reason for choosing Iceberg instead of Paimon
>
>
> No technical reason that I'm aware of.  We are using it mostly because of
> momentum.  We looked at Flink Table Store (before it was Paimon), but
> decided it was too early and the docs were too sparse at the time to really
> consider it.
>
>
>
> > Especially for a use case like CDC that iceberg struggles to support.
>
>
>
> We aren't doing any CDC right now (for many reasons), but I have never
> seen a feature like Paimon's database sync before.  One job to sync and
> evolve an entire database?  That is amazing.
>
>
>
> If we could do this with Iceberg, we might be able to make an argument to
> product managers to push for CDC.
>
>
>
>
>
>
>
> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos 
> wrote:
>
> I’m curious if there is any reason for choosing Iceberg instead of Paimon
> (other than - iceberg is more popular).
>
> Especially for a use case like CDC that iceberg struggles to support.
>
>
>
> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto  wrote:
>
> Interesting thank you!
>
>
>
> I asked this in the Paimon users group:
>
>
>
> How coupled to Paimon catalogs and tables is the cdc part of Paimon?
> RichCdcMultiplexRecord
> <https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java>
>  and
> related code seem incredibly useful even outside of the context of the
> Paimon table format.
>
>
>
> I'm asking because the database sync action
> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
>  feature
> is amazing.  At the Wikimedia Foundation, we are on an all-in journey with
> Iceberg.  I'm wondering how hard it would be to extract the CDC logic from
> Paimon and abstract the Sink bits.
>
>
>
> Could the table/database sync with schema evolution (without Flink job
> restarts!) potentially work with the Iceberg sink?
>
>
>
>
>
>
>
>
>
> On Thu, May 23, 2024 at 4:34 PM Péter Váry 
> wrote:
>
> If I understand correctly, P

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-26 Thread gongzhongqiang
Flink CDC 3.0 focuses on data integration scenarios, so you don't need to
pay attention to the framework implementation, you just need to use the
YAML format to describe the data source and target to quickly build a data
synchronization task with schema evolution.And it supports rich source and
sink.

So I think add pipeline sink connector for iceberg will be the better way
to go.We can rely on the Flink CDC 3.0 framework to connect source and
target easily.

Best,
Zhongqiang Gong


Péter Váry  于2024年5月25日周六 19:29写道:

> > Could the table/database sync with schema evolution (without Flink job
> restarts!) potentially work with the Iceberg sink?
>
> Making  this work would be a good addition to the Iceberg-Flink connector.
> It is definitely doable, but not a single PR sized task. If you want to try
> your hands on it, I will try to find time to review your plans/code, so
> your code could be incorporated into the upcoming releases.
>
> Thanks,
> Peter
>
>
>
> On Fri, May 24, 2024, 17:07 Andrew Otto  wrote:
>
>> > What is not is the automatic syncing of entire databases, with schema
>> evolution and detection of new (and dropped?) tables. :)
>> Wait.  Is it?
>>
>> > Flink CDC supports synchronizing all tables of source database
>> instance to downstream in one job by configuring the captured database list
>> and table list.
>>
>>
>> On Fri, May 24, 2024 at 11:04 AM Andrew Otto  wrote:
>>
>>> Indeed, using Flink-CDC to write to Flink Sink Tables, including
>>> Iceberg, is supported.
>>>
>>> What is not is the automatic syncing of entire databases, with schema
>>> evolution and detection of new (and dropped?) tables.  :)
>>>
>>>
>>>
>>>
>>> On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos 
>>> wrote:
>>>
 https://nightlies.apache.org/flink/flink-cdc-docs-stable/
 All these features come from Flink cdc itself. Because Paimon and Flink
 cdc are projects native to Flink there is a strong integration between 
 them.
 (I believe it’s on the roadmap to support iceberg as well)

 On Fri, 24 May 2024 at 3:52 PM, Andrew Otto  wrote:

> > I’m curious if there is any reason for choosing Iceberg instead of
> Paimon
>
> No technical reason that I'm aware of.  We are using it mostly because
> of momentum.  We looked at Flink Table Store (before it was Paimon), but
> decided it was too early and the docs were too sparse at the time to 
> really
> consider it.
>
> > Especially for a use case like CDC that iceberg struggles to
> support.
>
> We aren't doing any CDC right now (for many reasons), but I have never
> seen a feature like Paimon's database sync before.  One job to sync and
> evolve an entire database?  That is amazing.
>
> If we could do this with Iceberg, we might be able to make an argument
> to product managers to push for CDC.
>
>
>
> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos 
> wrote:
>
>> I’m curious if there is any reason for choosing Iceberg instead of
>> Paimon (other than - iceberg is more popular).
>> Especially for a use case like CDC that iceberg struggles to support.
>>
>> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto 
>> wrote:
>>
>>> Interesting thank you!
>>>
>>> I asked this in the Paimon users group:
>>>
>>> How coupled to Paimon catalogs and tables is the cdc part of
>>> Paimon?  RichCdcMultiplexRecord
>>> 
>>>  and
>>> related code seem incredibly useful even outside of the context of the
>>> Paimon table format.
>>>
>>> I'm asking because the database sync action
>>> 
>>>  feature
>>> is amazing.  At the Wikimedia Foundation, we are on an all-in journey 
>>> with
>>> Iceberg.  I'm wondering how hard it would be to extract the CDC logic 
>>> from
>>> Paimon and abstract the Sink bits.
>>>
>>> Could the table/database sync with schema evolution (without Flink
>>> job restarts!) potentially work with the Iceberg sink?
>>>
>>>
>>>
>>>
>>> On Thu, May 23, 2024 at 4:34 PM Péter Váry <
>>> peter.vary.apa...@gmail.com> wrote:
>>>
 If I understand correctly, Paimon is sending `CdcRecord`-s [1] on
 the wire which contain not only the data, but the schema as well.
 With Iceberg we currently only send the row data, and expect to
 receive the schema on job start - this is more performant than sending 
 the
 schema all the time, but has the obvious issue that it is not able to
 handle the schema changes. Another part of the dynamic schema
 synchronization is the update of the Iceberg table schema - the

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-26 Thread Xiqian YU
Hi Otto,

Flink CDC [1] now provides full-DB sync and schema evolution ability as a 
pipeline job. Iceberg sink support was suggested before, and we’re trying to 
implement this in the next few releases. Does it cover the use-cases you 
mentioned?

[1] https://nightlies.apache.org/flink/flink-cdc-docs-stable/
[2] https://issues.apache.org/jira/browse/FLINK-34840

Regards,
Xiqian


De : Andrew Otto 
Date : vendredi, 24 mai 2024 à 23:06
À : Giannis Polyzos 
Cc : Carlos Sanabria Miranda , Oscar Perez 
via user , Péter Váry , 
mbala...@apache.org 
Objet : Re: "Self-service ingestion pipelines with evolving schema via Flink 
and Iceberg" presentation recording from Flink Forward Seattle 2023
Indeed, using Flink-CDC to write to Flink Sink Tables, including Iceberg, is 
supported.

What is not is the automatic syncing of entire databases, with schema evolution 
and detection of new (and dropped?) tables.  :)




On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos 
mailto:ipolyzos...@gmail.com>> wrote:
https://nightlies.apache.org/flink/flink-cdc-docs-stable/
All these features come from Flink cdc itself. Because Paimon and Flink cdc are 
projects native to Flink there is a strong integration between them.
(I believe it’s on the roadmap to support iceberg as well)

On Fri, 24 May 2024 at 3:52 PM, Andrew Otto 
mailto:o...@wikimedia.org>> wrote:
> I’m curious if there is any reason for choosing Iceberg instead of Paimon

No technical reason that I'm aware of.  We are using it mostly because of 
momentum.  We looked at Flink Table Store (before it was Paimon), but decided 
it was too early and the docs were too sparse at the time to really consider it.

> Especially for a use case like CDC that iceberg struggles to support.

We aren't doing any CDC right now (for many reasons), but I have never seen a 
feature like Paimon's database sync before.  One job to sync and evolve an 
entire database?  That is amazing.

If we could do this with Iceberg, we might be able to make an argument to 
product managers to push for CDC.



On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos 
mailto:ipolyzos...@gmail.com>> wrote:
I’m curious if there is any reason for choosing Iceberg instead of Paimon 
(other than - iceberg is more popular).
Especially for a use case like CDC that iceberg struggles to support.

On Fri, 24 May 2024 at 3:22 PM, Andrew Otto 
mailto:o...@wikimedia.org>> wrote:
Interesting thank you!

I asked this in the Paimon users group:

How coupled to Paimon catalogs and tables is the cdc part of Paimon?  
RichCdcMultiplexRecord<https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java>
 and related code seem incredibly useful even outside of the context of the 
Paimon table format.

I'm asking because the database sync 
action<https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
 feature is amazing.  At the Wikimedia Foundation, we are on an all-in journey 
with Iceberg.  I'm wondering how hard it would be to extract the CDC logic from 
Paimon and abstract the Sink bits.

Could the table/database sync with schema evolution (without Flink job 
restarts!) potentially work with the Iceberg sink?




On Thu, May 23, 2024 at 4:34 PM Péter Váry 
mailto:peter.vary.apa...@gmail.com>> wrote:
If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the wire 
which contain not only the data, but the schema as well.
With Iceberg we currently only send the row data, and expect to receive the 
schema on job start - this is more performant than sending the schema all the 
time, but has the obvious issue that it is not able to handle the schema 
changes. Another part of the dynamic schema synchronization is the update of 
the Iceberg table schema - the schema should be updated for all of the writers 
and the committer / but only a single schema change commit is needed (allowed) 
to the Iceberg table.

This is a very interesting, but non-trivial change.

[1] 
https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java

Andrew Otto mailto:o...@wikimedia.org>> ezt írta (időpont: 
2024. máj. 23., Cs, 21:59):
Ah I see, so just auto-restarting to pick up new stuff.

I'd love to understand how Paimon does this.  They have a database sync 
action<https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
 which will sync entire databases, handle schema evolution, and I'm pretty sure 
(I think I saw this in my local test) also pick up new tables.

https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45

I'm sure that Paimon table format is great, but at Wik

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-25 Thread Péter Váry
> Could the table/database sync with schema evolution (without Flink job
restarts!) potentially work with the Iceberg sink?

Making  this work would be a good addition to the Iceberg-Flink connector.
It is definitely doable, but not a single PR sized task. If you want to try
your hands on it, I will try to find time to review your plans/code, so
your code could be incorporated into the upcoming releases.

Thanks,
Peter



On Fri, May 24, 2024, 17:07 Andrew Otto  wrote:

> > What is not is the automatic syncing of entire databases, with schema
> evolution and detection of new (and dropped?) tables. :)
> Wait.  Is it?
>
> > Flink CDC supports synchronizing all tables of source database instance
> to downstream in one job by configuring the captured database list and
> table list.
>
>
> On Fri, May 24, 2024 at 11:04 AM Andrew Otto  wrote:
>
>> Indeed, using Flink-CDC to write to Flink Sink Tables, including Iceberg,
>> is supported.
>>
>> What is not is the automatic syncing of entire databases, with schema
>> evolution and detection of new (and dropped?) tables.  :)
>>
>>
>>
>>
>> On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos 
>> wrote:
>>
>>> https://nightlies.apache.org/flink/flink-cdc-docs-stable/
>>> All these features come from Flink cdc itself. Because Paimon and Flink
>>> cdc are projects native to Flink there is a strong integration between them.
>>> (I believe it’s on the roadmap to support iceberg as well)
>>>
>>> On Fri, 24 May 2024 at 3:52 PM, Andrew Otto  wrote:
>>>
 > I’m curious if there is any reason for choosing Iceberg instead of
 Paimon

 No technical reason that I'm aware of.  We are using it mostly because
 of momentum.  We looked at Flink Table Store (before it was Paimon), but
 decided it was too early and the docs were too sparse at the time to really
 consider it.

 > Especially for a use case like CDC that iceberg struggles to support.

 We aren't doing any CDC right now (for many reasons), but I have never
 seen a feature like Paimon's database sync before.  One job to sync and
 evolve an entire database?  That is amazing.

 If we could do this with Iceberg, we might be able to make an argument
 to product managers to push for CDC.



 On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos 
 wrote:

> I’m curious if there is any reason for choosing Iceberg instead of
> Paimon (other than - iceberg is more popular).
> Especially for a use case like CDC that iceberg struggles to support.
>
> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto 
> wrote:
>
>> Interesting thank you!
>>
>> I asked this in the Paimon users group:
>>
>> How coupled to Paimon catalogs and tables is the cdc part of Paimon?
>> RichCdcMultiplexRecord
>> 
>>  and
>> related code seem incredibly useful even outside of the context of the
>> Paimon table format.
>>
>> I'm asking because the database sync action
>> 
>>  feature
>> is amazing.  At the Wikimedia Foundation, we are on an all-in journey 
>> with
>> Iceberg.  I'm wondering how hard it would be to extract the CDC logic 
>> from
>> Paimon and abstract the Sink bits.
>>
>> Could the table/database sync with schema evolution (without Flink
>> job restarts!) potentially work with the Iceberg sink?
>>
>>
>>
>>
>> On Thu, May 23, 2024 at 4:34 PM Péter Váry <
>> peter.vary.apa...@gmail.com> wrote:
>>
>>> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on
>>> the wire which contain not only the data, but the schema as well.
>>> With Iceberg we currently only send the row data, and expect to
>>> receive the schema on job start - this is more performant than sending 
>>> the
>>> schema all the time, but has the obvious issue that it is not able to
>>> handle the schema changes. Another part of the dynamic schema
>>> synchronization is the update of the Iceberg table schema - the schema
>>> should be updated for all of the writers and the committer / but only a
>>> single schema change commit is needed (allowed) to the Iceberg table.
>>>
>>> This is a very interesting, but non-trivial change.
>>>
>>> [1]
>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java
>>>
>>> Andrew Otto  ezt írta (időpont: 2024. máj. 23.,
>>> Cs, 21:59):
>>>
 Ah I see, so just auto-restarting to pick up new stuff.

 I'd love to understand how Paimon does this.  They have a database
 sync action
>>

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-24 Thread Andrew Otto
> What is not is the automatic syncing of entire databases, with schema
evolution and detection of new (and dropped?) tables. :)
Wait.  Is it?

> Flink CDC supports synchronizing all tables of source database instance
to downstream in one job by configuring the captured database list and
table list.


On Fri, May 24, 2024 at 11:04 AM Andrew Otto  wrote:

> Indeed, using Flink-CDC to write to Flink Sink Tables, including Iceberg,
> is supported.
>
> What is not is the automatic syncing of entire databases, with schema
> evolution and detection of new (and dropped?) tables.  :)
>
>
>
>
> On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos 
> wrote:
>
>> https://nightlies.apache.org/flink/flink-cdc-docs-stable/
>> All these features come from Flink cdc itself. Because Paimon and Flink
>> cdc are projects native to Flink there is a strong integration between them.
>> (I believe it’s on the roadmap to support iceberg as well)
>>
>> On Fri, 24 May 2024 at 3:52 PM, Andrew Otto  wrote:
>>
>>> > I’m curious if there is any reason for choosing Iceberg instead of
>>> Paimon
>>>
>>> No technical reason that I'm aware of.  We are using it mostly because
>>> of momentum.  We looked at Flink Table Store (before it was Paimon), but
>>> decided it was too early and the docs were too sparse at the time to really
>>> consider it.
>>>
>>> > Especially for a use case like CDC that iceberg struggles to support.
>>>
>>> We aren't doing any CDC right now (for many reasons), but I have never
>>> seen a feature like Paimon's database sync before.  One job to sync and
>>> evolve an entire database?  That is amazing.
>>>
>>> If we could do this with Iceberg, we might be able to make an argument
>>> to product managers to push for CDC.
>>>
>>>
>>>
>>> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos 
>>> wrote:
>>>
 I’m curious if there is any reason for choosing Iceberg instead of
 Paimon (other than - iceberg is more popular).
 Especially for a use case like CDC that iceberg struggles to support.

 On Fri, 24 May 2024 at 3:22 PM, Andrew Otto  wrote:

> Interesting thank you!
>
> I asked this in the Paimon users group:
>
> How coupled to Paimon catalogs and tables is the cdc part of Paimon?
> RichCdcMultiplexRecord
> 
>  and
> related code seem incredibly useful even outside of the context of the
> Paimon table format.
>
> I'm asking because the database sync action
> 
>  feature
> is amazing.  At the Wikimedia Foundation, we are on an all-in journey with
> Iceberg.  I'm wondering how hard it would be to extract the CDC logic from
> Paimon and abstract the Sink bits.
>
> Could the table/database sync with schema evolution (without Flink job
> restarts!) potentially work with the Iceberg sink?
>
>
>
>
> On Thu, May 23, 2024 at 4:34 PM Péter Váry <
> peter.vary.apa...@gmail.com> wrote:
>
>> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the
>> wire which contain not only the data, but the schema as well.
>> With Iceberg we currently only send the row data, and expect to
>> receive the schema on job start - this is more performant than sending 
>> the
>> schema all the time, but has the obvious issue that it is not able to
>> handle the schema changes. Another part of the dynamic schema
>> synchronization is the update of the Iceberg table schema - the schema
>> should be updated for all of the writers and the committer / but only a
>> single schema change commit is needed (allowed) to the Iceberg table.
>>
>> This is a very interesting, but non-trivial change.
>>
>> [1]
>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java
>>
>> Andrew Otto  ezt írta (időpont: 2024. máj. 23.,
>> Cs, 21:59):
>>
>>> Ah I see, so just auto-restarting to pick up new stuff.
>>>
>>> I'd love to understand how Paimon does this.  They have a database
>>> sync action
>>> 
>>> which will sync entire databases, handle schema evolution, and I'm 
>>> pretty
>>> sure (I think I saw this in my local test) also pick up new tables.
>>>
>>>
>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45
>>>
>>> I'm sure that Paimon table format is great, but at Wikimedia
>>> Foundation we are on the Iceberg train.  Imagine if there was a 
>>> flink-

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-24 Thread Andrew Otto
Indeed, using Flink-CDC to write to Flink Sink Tables, including Iceberg,
is supported.

What is not is the automatic syncing of entire databases, with schema
evolution and detection of new (and dropped?) tables.  :)




On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos 
wrote:

> https://nightlies.apache.org/flink/flink-cdc-docs-stable/
> All these features come from Flink cdc itself. Because Paimon and Flink
> cdc are projects native to Flink there is a strong integration between them.
> (I believe it’s on the roadmap to support iceberg as well)
>
> On Fri, 24 May 2024 at 3:52 PM, Andrew Otto  wrote:
>
>> > I’m curious if there is any reason for choosing Iceberg instead of
>> Paimon
>>
>> No technical reason that I'm aware of.  We are using it mostly because of
>> momentum.  We looked at Flink Table Store (before it was Paimon), but
>> decided it was too early and the docs were too sparse at the time to really
>> consider it.
>>
>> > Especially for a use case like CDC that iceberg struggles to support.
>>
>> We aren't doing any CDC right now (for many reasons), but I have never
>> seen a feature like Paimon's database sync before.  One job to sync and
>> evolve an entire database?  That is amazing.
>>
>> If we could do this with Iceberg, we might be able to make an argument to
>> product managers to push for CDC.
>>
>>
>>
>> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos 
>> wrote:
>>
>>> I’m curious if there is any reason for choosing Iceberg instead of
>>> Paimon (other than - iceberg is more popular).
>>> Especially for a use case like CDC that iceberg struggles to support.
>>>
>>> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto  wrote:
>>>
 Interesting thank you!

 I asked this in the Paimon users group:

 How coupled to Paimon catalogs and tables is the cdc part of Paimon?
 RichCdcMultiplexRecord
 
  and
 related code seem incredibly useful even outside of the context of the
 Paimon table format.

 I'm asking because the database sync action
 
  feature
 is amazing.  At the Wikimedia Foundation, we are on an all-in journey with
 Iceberg.  I'm wondering how hard it would be to extract the CDC logic from
 Paimon and abstract the Sink bits.

 Could the table/database sync with schema evolution (without Flink job
 restarts!) potentially work with the Iceberg sink?




 On Thu, May 23, 2024 at 4:34 PM Péter Váry 
 wrote:

> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the
> wire which contain not only the data, but the schema as well.
> With Iceberg we currently only send the row data, and expect to
> receive the schema on job start - this is more performant than sending the
> schema all the time, but has the obvious issue that it is not able to
> handle the schema changes. Another part of the dynamic schema
> synchronization is the update of the Iceberg table schema - the schema
> should be updated for all of the writers and the committer / but only a
> single schema change commit is needed (allowed) to the Iceberg table.
>
> This is a very interesting, but non-trivial change.
>
> [1]
> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java
>
> Andrew Otto  ezt írta (időpont: 2024. máj. 23.,
> Cs, 21:59):
>
>> Ah I see, so just auto-restarting to pick up new stuff.
>>
>> I'd love to understand how Paimon does this.  They have a database
>> sync action
>> 
>> which will sync entire databases, handle schema evolution, and I'm pretty
>> sure (I think I saw this in my local test) also pick up new tables.
>>
>>
>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45
>>
>> I'm sure that Paimon table format is great, but at Wikimedia
>> Foundation we are on the Iceberg train.  Imagine if there was a flink-cdc
>> full database sync to Flink IcebergSink!
>>
>>
>>
>>
>> On Thu, May 23, 2024 at 3:47 PM Péter Váry <
>> peter.vary.apa...@gmail.com> wrote:
>>
>>> I will ask Marton about the slides.
>>>
>>> The solution was something like this in a nutshell:
>>> - Make sure that on job start the latest Iceberg schema is read from
>>> the Iceberg table
>>> - Throw a SuppressRestartsException when data arrives with the wrong
>>> schema
>>> - Use Flink Kubernetes Oper

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-24 Thread Giannis Polyzos
https://nightlies.apache.org/flink/flink-cdc-docs-stable/
All these features come from Flink cdc itself. Because Paimon and Flink cdc
are projects native to Flink there is a strong integration between them.
(I believe it’s on the roadmap to support iceberg as well)

On Fri, 24 May 2024 at 3:52 PM, Andrew Otto  wrote:

> > I’m curious if there is any reason for choosing Iceberg instead of
> Paimon
>
> No technical reason that I'm aware of.  We are using it mostly because of
> momentum.  We looked at Flink Table Store (before it was Paimon), but
> decided it was too early and the docs were too sparse at the time to really
> consider it.
>
> > Especially for a use case like CDC that iceberg struggles to support.
>
> We aren't doing any CDC right now (for many reasons), but I have never
> seen a feature like Paimon's database sync before.  One job to sync and
> evolve an entire database?  That is amazing.
>
> If we could do this with Iceberg, we might be able to make an argument to
> product managers to push for CDC.
>
>
>
> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos 
> wrote:
>
>> I’m curious if there is any reason for choosing Iceberg instead of Paimon
>> (other than - iceberg is more popular).
>> Especially for a use case like CDC that iceberg struggles to support.
>>
>> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto  wrote:
>>
>>> Interesting thank you!
>>>
>>> I asked this in the Paimon users group:
>>>
>>> How coupled to Paimon catalogs and tables is the cdc part of Paimon?
>>> RichCdcMultiplexRecord
>>> 
>>>  and
>>> related code seem incredibly useful even outside of the context of the
>>> Paimon table format.
>>>
>>> I'm asking because the database sync action
>>> 
>>>  feature
>>> is amazing.  At the Wikimedia Foundation, we are on an all-in journey with
>>> Iceberg.  I'm wondering how hard it would be to extract the CDC logic from
>>> Paimon and abstract the Sink bits.
>>>
>>> Could the table/database sync with schema evolution (without Flink job
>>> restarts!) potentially work with the Iceberg sink?
>>>
>>>
>>>
>>>
>>> On Thu, May 23, 2024 at 4:34 PM Péter Váry 
>>> wrote:
>>>
 If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the
 wire which contain not only the data, but the schema as well.
 With Iceberg we currently only send the row data, and expect to receive
 the schema on job start - this is more performant than sending the schema
 all the time, but has the obvious issue that it is not able to handle the
 schema changes. Another part of the dynamic schema synchronization is the
 update of the Iceberg table schema - the schema should be updated for all
 of the writers and the committer / but only a single schema change commit
 is needed (allowed) to the Iceberg table.

 This is a very interesting, but non-trivial change.

 [1]
 https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java

 Andrew Otto  ezt írta (időpont: 2024. máj. 23.,
 Cs, 21:59):

> Ah I see, so just auto-restarting to pick up new stuff.
>
> I'd love to understand how Paimon does this.  They have a database
> sync action
> 
> which will sync entire databases, handle schema evolution, and I'm pretty
> sure (I think I saw this in my local test) also pick up new tables.
>
>
> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45
>
> I'm sure that Paimon table format is great, but at Wikimedia
> Foundation we are on the Iceberg train.  Imagine if there was a flink-cdc
> full database sync to Flink IcebergSink!
>
>
>
>
> On Thu, May 23, 2024 at 3:47 PM Péter Váry <
> peter.vary.apa...@gmail.com> wrote:
>
>> I will ask Marton about the slides.
>>
>> The solution was something like this in a nutshell:
>> - Make sure that on job start the latest Iceberg schema is read from
>> the Iceberg table
>> - Throw a SuppressRestartsException when data arrives with the wrong
>> schema
>> - Use Flink Kubernetes Operator to restart your failed jobs by
>> setting
>> kubernetes.operator.job.restart.failed
>>
>> Thanks, Peter
>>
>> On Thu, May 23, 2024, 20:29 Andrew Otto  wrote:
>>
>>> Wow, I would LOVE to see this talk.  If there is no recording,
>>> perhaps there are slides somewhere?
>>>
>>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
>>> 

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-24 Thread Andrew Otto
> I’m curious if there is any reason for choosing Iceberg instead of Paimon

No technical reason that I'm aware of.  We are using it mostly because of
momentum.  We looked at Flink Table Store (before it was Paimon), but
decided it was too early and the docs were too sparse at the time to really
consider it.

> Especially for a use case like CDC that iceberg struggles to support.

We aren't doing any CDC right now (for many reasons), but I have never seen
a feature like Paimon's database sync before.  One job to sync and evolve
an entire database?  That is amazing.

If we could do this with Iceberg, we might be able to make an argument to
product managers to push for CDC.



On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos 
wrote:

> I’m curious if there is any reason for choosing Iceberg instead of Paimon
> (other than - iceberg is more popular).
> Especially for a use case like CDC that iceberg struggles to support.
>
> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto  wrote:
>
>> Interesting thank you!
>>
>> I asked this in the Paimon users group:
>>
>> How coupled to Paimon catalogs and tables is the cdc part of Paimon?
>> RichCdcMultiplexRecord
>> 
>>  and
>> related code seem incredibly useful even outside of the context of the
>> Paimon table format.
>>
>> I'm asking because the database sync action
>> 
>>  feature
>> is amazing.  At the Wikimedia Foundation, we are on an all-in journey with
>> Iceberg.  I'm wondering how hard it would be to extract the CDC logic from
>> Paimon and abstract the Sink bits.
>>
>> Could the table/database sync with schema evolution (without Flink job
>> restarts!) potentially work with the Iceberg sink?
>>
>>
>>
>>
>> On Thu, May 23, 2024 at 4:34 PM Péter Váry 
>> wrote:
>>
>>> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the
>>> wire which contain not only the data, but the schema as well.
>>> With Iceberg we currently only send the row data, and expect to receive
>>> the schema on job start - this is more performant than sending the schema
>>> all the time, but has the obvious issue that it is not able to handle the
>>> schema changes. Another part of the dynamic schema synchronization is the
>>> update of the Iceberg table schema - the schema should be updated for all
>>> of the writers and the committer / but only a single schema change commit
>>> is needed (allowed) to the Iceberg table.
>>>
>>> This is a very interesting, but non-trivial change.
>>>
>>> [1]
>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java
>>>
>>> Andrew Otto  ezt írta (időpont: 2024. máj. 23., Cs,
>>> 21:59):
>>>
 Ah I see, so just auto-restarting to pick up new stuff.

 I'd love to understand how Paimon does this.  They have a database
 sync action
 
 which will sync entire databases, handle schema evolution, and I'm pretty
 sure (I think I saw this in my local test) also pick up new tables.


 https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45

 I'm sure that Paimon table format is great, but at Wikimedia Foundation
 we are on the Iceberg train.  Imagine if there was a flink-cdc full
 database sync to Flink IcebergSink!




 On Thu, May 23, 2024 at 3:47 PM Péter Váry 
 wrote:

> I will ask Marton about the slides.
>
> The solution was something like this in a nutshell:
> - Make sure that on job start the latest Iceberg schema is read from
> the Iceberg table
> - Throw a SuppressRestartsException when data arrives with the wrong
> schema
> - Use Flink Kubernetes Operator to restart your failed jobs by setting
> kubernetes.operator.job.restart.failed
>
> Thanks, Peter
>
> On Thu, May 23, 2024, 20:29 Andrew Otto  wrote:
>
>> Wow, I would LOVE to see this talk.  If there is no recording,
>> perhaps there are slides somewhere?
>>
>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
>> sanabria.miranda.car...@gmail.com> wrote:
>>
>>> Hi everyone!
>>>
>>> I have found in the Flink Forward website the following
>>> presentation: "Self-service ingestion pipelines with evolving
>>> schema via Flink and Iceberg
>>> "
>>> by Márton Balassi from the 2023 conference in Seattle, but I cannot find
>>> the recor

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-24 Thread Giannis Polyzos
I’m curious if there is any reason for choosing Iceberg instead of Paimon
(other than - iceberg is more popular).
Especially for a use case like CDC that iceberg struggles to support.

On Fri, 24 May 2024 at 3:22 PM, Andrew Otto  wrote:

> Interesting thank you!
>
> I asked this in the Paimon users group:
>
> How coupled to Paimon catalogs and tables is the cdc part of Paimon?
> RichCdcMultiplexRecord
> 
>  and
> related code seem incredibly useful even outside of the context of the
> Paimon table format.
>
> I'm asking because the database sync action
> 
>  feature
> is amazing.  At the Wikimedia Foundation, we are on an all-in journey with
> Iceberg.  I'm wondering how hard it would be to extract the CDC logic from
> Paimon and abstract the Sink bits.
>
> Could the table/database sync with schema evolution (without Flink job
> restarts!) potentially work with the Iceberg sink?
>
>
>
>
> On Thu, May 23, 2024 at 4:34 PM Péter Váry 
> wrote:
>
>> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the
>> wire which contain not only the data, but the schema as well.
>> With Iceberg we currently only send the row data, and expect to receive
>> the schema on job start - this is more performant than sending the schema
>> all the time, but has the obvious issue that it is not able to handle the
>> schema changes. Another part of the dynamic schema synchronization is the
>> update of the Iceberg table schema - the schema should be updated for all
>> of the writers and the committer / but only a single schema change commit
>> is needed (allowed) to the Iceberg table.
>>
>> This is a very interesting, but non-trivial change.
>>
>> [1]
>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java
>>
>> Andrew Otto  ezt írta (időpont: 2024. máj. 23., Cs,
>> 21:59):
>>
>>> Ah I see, so just auto-restarting to pick up new stuff.
>>>
>>> I'd love to understand how Paimon does this.  They have a database sync
>>> action
>>> 
>>> which will sync entire databases, handle schema evolution, and I'm pretty
>>> sure (I think I saw this in my local test) also pick up new tables.
>>>
>>>
>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45
>>>
>>> I'm sure that Paimon table format is great, but at Wikimedia Foundation
>>> we are on the Iceberg train.  Imagine if there was a flink-cdc full
>>> database sync to Flink IcebergSink!
>>>
>>>
>>>
>>>
>>> On Thu, May 23, 2024 at 3:47 PM Péter Váry 
>>> wrote:
>>>
 I will ask Marton about the slides.

 The solution was something like this in a nutshell:
 - Make sure that on job start the latest Iceberg schema is read from
 the Iceberg table
 - Throw a SuppressRestartsException when data arrives with the wrong
 schema
 - Use Flink Kubernetes Operator to restart your failed jobs by setting
 kubernetes.operator.job.restart.failed

 Thanks, Peter

 On Thu, May 23, 2024, 20:29 Andrew Otto  wrote:

> Wow, I would LOVE to see this talk.  If there is no recording, perhaps
> there are slides somewhere?
>
> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
> sanabria.miranda.car...@gmail.com> wrote:
>
>> Hi everyone!
>>
>> I have found in the Flink Forward website the following presentation:
>> "Self-service ingestion pipelines with evolving schema via Flink and
>> Iceberg
>> "
>> by Márton Balassi from the 2023 conference in Seattle, but I cannot find
>> the recording anywhere. I have found the recordings of the other
>> presentations in the Ververica Academy website
>> , but not this one.
>>
>> Does anyone know where I can find it? Or at least the slides?
>>
>> We are using Flink with the Iceberg sink connector to write streaming
>> events to Iceberg tables, and we are researching how to handle schema
>> evolution properly. I saw that presentation and I thought it could be of
>> great help to us.
>>
>> Thanks in advance!
>>
>> Carlos
>>
>


Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-24 Thread Andrew Otto
Interesting thank you!

I asked this in the Paimon users group:

How coupled to Paimon catalogs and tables is the cdc part of Paimon?
RichCdcMultiplexRecord

and
related code seem incredibly useful even outside of the context of the
Paimon table format.

I'm asking because the database sync action

feature
is amazing.  At the Wikimedia Foundation, we are on an all-in journey with
Iceberg.  I'm wondering how hard it would be to extract the CDC logic from
Paimon and abstract the Sink bits.

Could the table/database sync with schema evolution (without Flink job
restarts!) potentially work with the Iceberg sink?




On Thu, May 23, 2024 at 4:34 PM Péter Váry 
wrote:

> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the wire
> which contain not only the data, but the schema as well.
> With Iceberg we currently only send the row data, and expect to receive
> the schema on job start - this is more performant than sending the schema
> all the time, but has the obvious issue that it is not able to handle the
> schema changes. Another part of the dynamic schema synchronization is the
> update of the Iceberg table schema - the schema should be updated for all
> of the writers and the committer / but only a single schema change commit
> is needed (allowed) to the Iceberg table.
>
> This is a very interesting, but non-trivial change.
>
> [1]
> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java
>
> Andrew Otto  ezt írta (időpont: 2024. máj. 23., Cs,
> 21:59):
>
>> Ah I see, so just auto-restarting to pick up new stuff.
>>
>> I'd love to understand how Paimon does this.  They have a database sync
>> action
>> 
>> which will sync entire databases, handle schema evolution, and I'm pretty
>> sure (I think I saw this in my local test) also pick up new tables.
>>
>>
>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45
>>
>> I'm sure that Paimon table format is great, but at Wikimedia Foundation
>> we are on the Iceberg train.  Imagine if there was a flink-cdc full
>> database sync to Flink IcebergSink!
>>
>>
>>
>>
>> On Thu, May 23, 2024 at 3:47 PM Péter Váry 
>> wrote:
>>
>>> I will ask Marton about the slides.
>>>
>>> The solution was something like this in a nutshell:
>>> - Make sure that on job start the latest Iceberg schema is read from the
>>> Iceberg table
>>> - Throw a SuppressRestartsException when data arrives with the wrong
>>> schema
>>> - Use Flink Kubernetes Operator to restart your failed jobs by setting
>>> kubernetes.operator.job.restart.failed
>>>
>>> Thanks, Peter
>>>
>>> On Thu, May 23, 2024, 20:29 Andrew Otto  wrote:
>>>
 Wow, I would LOVE to see this talk.  If there is no recording, perhaps
 there are slides somewhere?

 On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
 sanabria.miranda.car...@gmail.com> wrote:

> Hi everyone!
>
> I have found in the Flink Forward website the following presentation: 
> "Self-service
> ingestion pipelines with evolving schema via Flink and Iceberg
> "
> by Márton Balassi from the 2023 conference in Seattle, but I cannot find
> the recording anywhere. I have found the recordings of the other
> presentations in the Ververica Academy website
> , but not this one.
>
> Does anyone know where I can find it? Or at least the slides?
>
> We are using Flink with the Iceberg sink connector to write streaming
> events to Iceberg tables, and we are researching how to handle schema
> evolution properly. I saw that presentation and I thought it could be of
> great help to us.
>
> Thanks in advance!
>
> Carlos
>



Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-23 Thread Péter Váry
If I understand correctly, Paimon is sending `CdcRecord`-s [1] on the wire
which contain not only the data, but the schema as well.
With Iceberg we currently only send the row data, and expect to receive the
schema on job start - this is more performant than sending the schema all
the time, but has the obvious issue that it is not able to handle the
schema changes. Another part of the dynamic schema synchronization is the
update of the Iceberg table schema - the schema should be updated for all
of the writers and the committer / but only a single schema change commit
is needed (allowed) to the Iceberg table.

This is a very interesting, but non-trivial change.

[1]
https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java

Andrew Otto  ezt írta (időpont: 2024. máj. 23., Cs,
21:59):

> Ah I see, so just auto-restarting to pick up new stuff.
>
> I'd love to understand how Paimon does this.  They have a database sync
> action
> 
> which will sync entire databases, handle schema evolution, and I'm pretty
> sure (I think I saw this in my local test) also pick up new tables.
>
>
> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45
>
> I'm sure that Paimon table format is great, but at Wikimedia Foundation we
> are on the Iceberg train.  Imagine if there was a flink-cdc full database
> sync to Flink IcebergSink!
>
>
>
>
> On Thu, May 23, 2024 at 3:47 PM Péter Váry 
> wrote:
>
>> I will ask Marton about the slides.
>>
>> The solution was something like this in a nutshell:
>> - Make sure that on job start the latest Iceberg schema is read from the
>> Iceberg table
>> - Throw a SuppressRestartsException when data arrives with the wrong
>> schema
>> - Use Flink Kubernetes Operator to restart your failed jobs by setting
>> kubernetes.operator.job.restart.failed
>>
>> Thanks, Peter
>>
>> On Thu, May 23, 2024, 20:29 Andrew Otto  wrote:
>>
>>> Wow, I would LOVE to see this talk.  If there is no recording, perhaps
>>> there are slides somewhere?
>>>
>>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
>>> sanabria.miranda.car...@gmail.com> wrote:
>>>
 Hi everyone!

 I have found in the Flink Forward website the following presentation: 
 "Self-service
 ingestion pipelines with evolving schema via Flink and Iceberg
 "
 by Márton Balassi from the 2023 conference in Seattle, but I cannot find
 the recording anywhere. I have found the recordings of the other
 presentations in the Ververica Academy website
 , but not this one.

 Does anyone know where I can find it? Or at least the slides?

 We are using Flink with the Iceberg sink connector to write streaming
 events to Iceberg tables, and we are researching how to handle schema
 evolution properly. I saw that presentation and I thought it could be of
 great help to us.

 Thanks in advance!

 Carlos

>>>


Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-23 Thread Andrew Otto
Ah I see, so just auto-restarting to pick up new stuff.

I'd love to understand how Paimon does this.  They have a database sync
action

which will sync entire databases, handle schema evolution, and I'm pretty
sure (I think I saw this in my local test) also pick up new tables.

https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45

I'm sure that Paimon table format is great, but at Wikimedia Foundation we
are on the Iceberg train.  Imagine if there was a flink-cdc full database
sync to Flink IcebergSink!




On Thu, May 23, 2024 at 3:47 PM Péter Váry 
wrote:

> I will ask Marton about the slides.
>
> The solution was something like this in a nutshell:
> - Make sure that on job start the latest Iceberg schema is read from the
> Iceberg table
> - Throw a SuppressRestartsException when data arrives with the wrong schema
> - Use Flink Kubernetes Operator to restart your failed jobs by setting
> kubernetes.operator.job.restart.failed
>
> Thanks, Peter
>
> On Thu, May 23, 2024, 20:29 Andrew Otto  wrote:
>
>> Wow, I would LOVE to see this talk.  If there is no recording, perhaps
>> there are slides somewhere?
>>
>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
>> sanabria.miranda.car...@gmail.com> wrote:
>>
>>> Hi everyone!
>>>
>>> I have found in the Flink Forward website the following presentation: 
>>> "Self-service
>>> ingestion pipelines with evolving schema via Flink and Iceberg
>>> "
>>> by Márton Balassi from the 2023 conference in Seattle, but I cannot find
>>> the recording anywhere. I have found the recordings of the other
>>> presentations in the Ververica Academy website
>>> , but not this one.
>>>
>>> Does anyone know where I can find it? Or at least the slides?
>>>
>>> We are using Flink with the Iceberg sink connector to write streaming
>>> events to Iceberg tables, and we are researching how to handle schema
>>> evolution properly. I saw that presentation and I thought it could be of
>>> great help to us.
>>>
>>> Thanks in advance!
>>>
>>> Carlos
>>>
>>


Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-23 Thread Péter Váry
I will ask Marton about the slides.

The solution was something like this in a nutshell:
- Make sure that on job start the latest Iceberg schema is read from the
Iceberg table
- Throw a SuppressRestartsException when data arrives with the wrong schema
- Use Flink Kubernetes Operator to restart your failed jobs by setting
kubernetes.operator.job.restart.failed

Thanks, Peter

On Thu, May 23, 2024, 20:29 Andrew Otto  wrote:

> Wow, I would LOVE to see this talk.  If there is no recording, perhaps
> there are slides somewhere?
>
> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
> sanabria.miranda.car...@gmail.com> wrote:
>
>> Hi everyone!
>>
>> I have found in the Flink Forward website the following presentation: 
>> "Self-service
>> ingestion pipelines with evolving schema via Flink and Iceberg
>> "
>> by Márton Balassi from the 2023 conference in Seattle, but I cannot find
>> the recording anywhere. I have found the recordings of the other
>> presentations in the Ververica Academy website
>> , but not this one.
>>
>> Does anyone know where I can find it? Or at least the slides?
>>
>> We are using Flink with the Iceberg sink connector to write streaming
>> events to Iceberg tables, and we are researching how to handle schema
>> evolution properly. I saw that presentation and I thought it could be of
>> great help to us.
>>
>> Thanks in advance!
>>
>> Carlos
>>
>


Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

2024-05-23 Thread Andrew Otto
Wow, I would LOVE to see this talk.  If there is no recording, perhaps
there are slides somewhere?

On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
sanabria.miranda.car...@gmail.com> wrote:

> Hi everyone!
>
> I have found in the Flink Forward website the following presentation: 
> "Self-service
> ingestion pipelines with evolving schema via Flink and Iceberg
> "
> by Márton Balassi from the 2023 conference in Seattle, but I cannot find
> the recording anywhere. I have found the recordings of the other
> presentations in the Ververica Academy website
> , but not this one.
>
> Does anyone know where I can find it? Or at least the slides?
>
> We are using Flink with the Iceberg sink connector to write streaming
> events to Iceberg tables, and we are researching how to handle schema
> evolution properly. I saw that presentation and I thought it could be of
> great help to us.
>
> Thanks in advance!
>
> Carlos
>