Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Jing Ge
Awesome! +1

Best regards,
Jing

On Thu, Dec 7, 2023 at 8:34 AM Sergey Nuyanzin  wrote:

> thanks for working on this and driving it
>
> +1
>
> On Thu, Dec 7, 2023 at 7:26 AM Feng Jin  wrote:
>
> > This is incredibly exciting news, a big +1 for this.
> >
> > Thank you for the fantastic work on Flink CDC. We have created thousands
> of
> > real-time integration jobs using Flink CDC connectors.
> >
> >
> > Best,
> > Feng
> >
> > On Thu, Dec 7, 2023 at 1:45 PM gongzhongqiang  >
> > wrote:
> >
> > > It's very exciting to hear the news.
> > > +1 for adding CDC Connectors  to Apache Flink !
> > >
> > >
> > > Best,
> > > Zhongqiang
> > >
> > > Leonard Xu  于2023年12月7日周四 11:25写道:
> > >
> > > > Dear Flink devs,
> > > >
> > > >
> > > > As you may have heard, we at Alibaba (Ververica) are planning to
> donate
> > > CDC Connectors for the Apache Flink project
> > > > *[1]* to the Apache Flink community.
> > > >
> > > >
> > > >
> > > > CDC Connectors for Apache Flink comprise a collection of source
> > > connectors designed specifically for Apache Flink. These connectors
> > > > *[2]*
> > > >  enable the ingestion of changes from various databases using Change
> > > Data Capture (CDC), most of these CDC connectors are powered by
> Debezium
> > > > *[3]*
> > > > . They support both the DataStream API and the Table/SQL API,
> > > facilitating the reading of database snapshots and continuous reading
> of
> > > transaction logs with exactly-once processing, even in the event of
> > > failures.
> > > >
> > > >
> > > >
> > > > Additionally, in the latest version 3.0, we have introduced many
> > > long-awaited features. Starting from CDC version 3.0, we've built a
> > > Streaming ELT Framework available for streaming data integration. This
> > > framework allows users to write their data synchronization logic in a
> > > simple YAML file, which will automatically be translated into a Flink
> > > DataStreaming job. It emphasizes optimizing the task submission process
> > and
> > > offers advanced functionalities such as whole database synchronization,
> > > merging sharded tables, and schema evolution
> > > > *[4]*.
> > > >
> > > >
> > > >
> > > >
> > > > I believe this initiative is a perfect match for both sides. For the
> > > Flink community, it presents an opportunity to enhance Flink's
> > competitive
> > > advantage in streaming data integration, promoting the healthy growth
> and
> > > prosperity of the Apache Flink ecosystem. For the CDC Connectors
> project,
> > > becoming a sub-project of Apache Flink means being part of a neutral
> > > open-source community, which can attract a more diverse pool of
> > > contributors.
> > > >
> > > >
> > > > Please note that the aforementioned points represent only some of our
> > > motivations and vision for this donation. Specific future operations
> need
> > > to be further discussed in this thread. For example, the sub-project
> name
> > > after the donation; we hope to name it Flink-CDC
> > > > aiming to streaming data intergration through Apache Flink,
> > > > following the naming convention of Flink-ML; And this project is
> > managed
> > > by a total of 8 maintainers, including 3 Flink PMC members and 1 Flink
> > > Committer. The remaining 4 maintainers are also highly active
> > contributors
> > > to the Flink community, donating this project to the Flink community
> > > implies that their permissions might be reduced. Therefore, we may need
> > to
> > > bring up this topic for further discussion within the Flink PMC.
> > > Additionally, we need to discuss how to migrate existing users and
> > > documents. We have a user group of nearly 10,000 people and a
> > multi-version
> > > documentation site need to migrate. We also need to plan for the
> > migration
> > > of CI/CD processes and other specifics.
> > > >
> > > >
> > > >
> > > > While there are many intricate details that require implementation,
> we
> > > are committed to progressing and finalizing this donation process.
> > > >
> > > >
> > > >
> > > > Despite being Flink’s most active ecological project (as evaluated by
> > > GitHub metrics), it also boasts a significant user base. However, I
> > believe
> > > it's essential to commence discussions on future operations only after
> > the
> > > community reaches a consensus on whether they desire this donation.
> > > >
> > > >
> > > > Really looking forward to hear what you think!
> > > >
> > > >
> > > >
> > > > Best,
> > > > Leonard (on behalf of the Flink CDC Connectors project maintainers)
> > > >
> > > > [1] https://github.com/ververica/flink-cdc-connectors
> > > > [2]
> > > >
> > >
> >
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
> > > > [3] https://debezium.io
> > > > [4]
> > > >
> > >
> >
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
> > > >
> > >
> >
>
>
> --
> Best regards,
> Sergey
>


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Sergey Nuyanzin
thanks for working on this and driving it

+1

On Thu, Dec 7, 2023 at 7:26 AM Feng Jin  wrote:

> This is incredibly exciting news, a big +1 for this.
>
> Thank you for the fantastic work on Flink CDC. We have created thousands of
> real-time integration jobs using Flink CDC connectors.
>
>
> Best,
> Feng
>
> On Thu, Dec 7, 2023 at 1:45 PM gongzhongqiang 
> wrote:
>
> > It's very exciting to hear the news.
> > +1 for adding CDC Connectors  to Apache Flink !
> >
> >
> > Best,
> > Zhongqiang
> >
> > Leonard Xu  于2023年12月7日周四 11:25写道:
> >
> > > Dear Flink devs,
> > >
> > >
> > > As you may have heard, we at Alibaba (Ververica) are planning to donate
> > CDC Connectors for the Apache Flink project
> > > *[1]* to the Apache Flink community.
> > >
> > >
> > >
> > > CDC Connectors for Apache Flink comprise a collection of source
> > connectors designed specifically for Apache Flink. These connectors
> > > *[2]*
> > >  enable the ingestion of changes from various databases using Change
> > Data Capture (CDC), most of these CDC connectors are powered by Debezium
> > > *[3]*
> > > . They support both the DataStream API and the Table/SQL API,
> > facilitating the reading of database snapshots and continuous reading of
> > transaction logs with exactly-once processing, even in the event of
> > failures.
> > >
> > >
> > >
> > > Additionally, in the latest version 3.0, we have introduced many
> > long-awaited features. Starting from CDC version 3.0, we've built a
> > Streaming ELT Framework available for streaming data integration. This
> > framework allows users to write their data synchronization logic in a
> > simple YAML file, which will automatically be translated into a Flink
> > DataStreaming job. It emphasizes optimizing the task submission process
> and
> > offers advanced functionalities such as whole database synchronization,
> > merging sharded tables, and schema evolution
> > > *[4]*.
> > >
> > >
> > >
> > >
> > > I believe this initiative is a perfect match for both sides. For the
> > Flink community, it presents an opportunity to enhance Flink's
> competitive
> > advantage in streaming data integration, promoting the healthy growth and
> > prosperity of the Apache Flink ecosystem. For the CDC Connectors project,
> > becoming a sub-project of Apache Flink means being part of a neutral
> > open-source community, which can attract a more diverse pool of
> > contributors.
> > >
> > >
> > > Please note that the aforementioned points represent only some of our
> > motivations and vision for this donation. Specific future operations need
> > to be further discussed in this thread. For example, the sub-project name
> > after the donation; we hope to name it Flink-CDC
> > > aiming to streaming data intergration through Apache Flink,
> > > following the naming convention of Flink-ML; And this project is
> managed
> > by a total of 8 maintainers, including 3 Flink PMC members and 1 Flink
> > Committer. The remaining 4 maintainers are also highly active
> contributors
> > to the Flink community, donating this project to the Flink community
> > implies that their permissions might be reduced. Therefore, we may need
> to
> > bring up this topic for further discussion within the Flink PMC.
> > Additionally, we need to discuss how to migrate existing users and
> > documents. We have a user group of nearly 10,000 people and a
> multi-version
> > documentation site need to migrate. We also need to plan for the
> migration
> > of CI/CD processes and other specifics.
> > >
> > >
> > >
> > > While there are many intricate details that require implementation, we
> > are committed to progressing and finalizing this donation process.
> > >
> > >
> > >
> > > Despite being Flink’s most active ecological project (as evaluated by
> > GitHub metrics), it also boasts a significant user base. However, I
> believe
> > it's essential to commence discussions on future operations only after
> the
> > community reaches a consensus on whether they desire this donation.
> > >
> > >
> > > Really looking forward to hear what you think!
> > >
> > >
> > >
> > > Best,
> > > Leonard (on behalf of the Flink CDC Connectors project maintainers)
> > >
> > > [1] https://github.com/ververica/flink-cdc-connectors
> > > [2]
> > >
> >
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
> > > [3] https://debezium.io
> > > [4]
> > >
> >
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
> > >
> >
>


-- 
Best regards,
Sergey


SQL return type change from 1.17 to 1.18

2023-12-06 Thread Péter Váry
Hi Team,

We are working on upgrading the Iceberg-Flink connector from 1.17 to 1.18,
and found that some of our tests are failing. Prabhu Joseph created a jira
[1] to discuss this issue, along with short example code.

In a nutshell:
- Create a table with an 'ARRAY' column
- Run a select which returns this column
- The return type changes:
- From 'Object[]' - in 1.17
- To 'int[]' - in 1.18

The change is introduced by this jira [2].

While I understand the reasoning behind this change, this will break some
users existing workflow as evidenced by Xingcan Cui finding this
independently [3].

What is the opinion of the community about this change?
- Do we want to revert the change?
- Do we ask the owners of the change to make this behavior configurable?
- Do we accept this behavior change in a minor release?

Thanks,
Peter

[1] - https://issues.apache.org/jira/browse/FLINK-33523 - DataType
ARRAY fails to cast into Object[]
[2] - https://issues.apache.org/jira/browse/FLINK-31835 - DataTypeHint
don't support Row>
[3] - https://issues.apache.org/jira/browse/FLINK-33547 - SQL primitive
array type after upgrading to Flink 1.18.0


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Feng Jin
This is incredibly exciting news, a big +1 for this.

Thank you for the fantastic work on Flink CDC. We have created thousands of
real-time integration jobs using Flink CDC connectors.


Best,
Feng

On Thu, Dec 7, 2023 at 1:45 PM gongzhongqiang 
wrote:

> It's very exciting to hear the news.
> +1 for adding CDC Connectors  to Apache Flink !
>
>
> Best,
> Zhongqiang
>
> Leonard Xu  于2023年12月7日周四 11:25写道:
>
> > Dear Flink devs,
> >
> >
> > As you may have heard, we at Alibaba (Ververica) are planning to donate
> CDC Connectors for the Apache Flink project
> > *[1]* to the Apache Flink community.
> >
> >
> >
> > CDC Connectors for Apache Flink comprise a collection of source
> connectors designed specifically for Apache Flink. These connectors
> > *[2]*
> >  enable the ingestion of changes from various databases using Change
> Data Capture (CDC), most of these CDC connectors are powered by Debezium
> > *[3]*
> > . They support both the DataStream API and the Table/SQL API,
> facilitating the reading of database snapshots and continuous reading of
> transaction logs with exactly-once processing, even in the event of
> failures.
> >
> >
> >
> > Additionally, in the latest version 3.0, we have introduced many
> long-awaited features. Starting from CDC version 3.0, we've built a
> Streaming ELT Framework available for streaming data integration. This
> framework allows users to write their data synchronization logic in a
> simple YAML file, which will automatically be translated into a Flink
> DataStreaming job. It emphasizes optimizing the task submission process and
> offers advanced functionalities such as whole database synchronization,
> merging sharded tables, and schema evolution
> > *[4]*.
> >
> >
> >
> >
> > I believe this initiative is a perfect match for both sides. For the
> Flink community, it presents an opportunity to enhance Flink's competitive
> advantage in streaming data integration, promoting the healthy growth and
> prosperity of the Apache Flink ecosystem. For the CDC Connectors project,
> becoming a sub-project of Apache Flink means being part of a neutral
> open-source community, which can attract a more diverse pool of
> contributors.
> >
> >
> > Please note that the aforementioned points represent only some of our
> motivations and vision for this donation. Specific future operations need
> to be further discussed in this thread. For example, the sub-project name
> after the donation; we hope to name it Flink-CDC
> > aiming to streaming data intergration through Apache Flink,
> > following the naming convention of Flink-ML; And this project is managed
> by a total of 8 maintainers, including 3 Flink PMC members and 1 Flink
> Committer. The remaining 4 maintainers are also highly active contributors
> to the Flink community, donating this project to the Flink community
> implies that their permissions might be reduced. Therefore, we may need to
> bring up this topic for further discussion within the Flink PMC.
> Additionally, we need to discuss how to migrate existing users and
> documents. We have a user group of nearly 10,000 people and a multi-version
> documentation site need to migrate. We also need to plan for the migration
> of CI/CD processes and other specifics.
> >
> >
> >
> > While there are many intricate details that require implementation, we
> are committed to progressing and finalizing this donation process.
> >
> >
> >
> > Despite being Flink’s most active ecological project (as evaluated by
> GitHub metrics), it also boasts a significant user base. However, I believe
> it's essential to commence discussions on future operations only after the
> community reaches a consensus on whether they desire this donation.
> >
> >
> > Really looking forward to hear what you think!
> >
> >
> >
> > Best,
> > Leonard (on behalf of the Flink CDC Connectors project maintainers)
> >
> > [1] https://github.com/ververica/flink-cdc-connectors
> > [2]
> >
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
> > [3] https://debezium.io
> > [4]
> >
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
> >
>


Re: [DISCUSS] FLIP-400: AsyncScalarFunction for asynchronous scalar function support

2023-12-06 Thread Alan Sheinberg
>
> Nicely written and makes sense.  The only feedback I have is around the
> naming of the generalization, e.g. "Specifically, PythonCalcSplitRuleBase
> will be generalized into RemoteCalcSplitRuleBase."  This naming seems to
> imply/suggest that all Async functions are remote.  I wonder if we can find
> another name which doesn't carry that connotation; maybe
> AsyncCalcSplitRuleBase.  (An AsyncCalcSplitRuleBase which handles Python
> and Async functions seems reasonable.)
>
Thanks.  That's fair.  I agree that "Remote" isn't always accurate.  I
believe that the python calls are also done asynchronously, so that might
be a reasonable name, so long as there's no confusion between the base and
async child class.

On Wed, Dec 6, 2023 at 3:48 PM Jim Hughes 
wrote:

> Hi Alan,
>
> Nicely written and makes sense.  The only feedback I have is around the
> naming of the generalization, e.g. "Specifically, PythonCalcSplitRuleBase
> will be generalized into RemoteCalcSplitRuleBase."  This naming seems to
> imply/suggest that all Async functions are remote.  I wonder if we can find
> another name which doesn't carry that connotation; maybe
> AsyncCalcSplitRuleBase.  (An AsyncCalcSplitRuleBase which handles Python
> and Async functions seems reasonable.)
>
> Cheers,
>
> Jim
>
> On Wed, Dec 6, 2023 at 5:45 PM Alan Sheinberg
>  wrote:
>
> > I'd like to start a discussion of FLIP-400: AsyncScalarFunction for
> > asynchronous scalar function support [1]
> >
> > This feature proposes adding a new UDF type AsyncScalarFunction which is
> > invoked just like a normal ScalarFunction, but is implemented with an
> > asynchronous eval method.  I had brought this up including the motivation
> > in a previous discussion thread [2].
> >
> > The purpose is to achieve high throughput scalar function UDFs while
> > allowing that an individual call may have high latency.  It allows
> scaling
> > up the parallelism of just these calls without having to increase the
> > parallelism of the whole query (which could be rather resource
> > inefficient).
> >
> > In practice, it should enable SQL integration with external services and
> > systems, which Flink has limited support for at the moment. It should
> also
> > allow easier integration with existing libraries which use asynchronous
> > APIs.
> >
> > Looking forward to your feedback and suggestions.
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-400%3A+AsyncScalarFunction+for+asynchronous+scalar+function+support
> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-400%3A+AsyncScalarFunction+for+asynchronous+scalar+function+support
> > >
> >
> > [2] https://lists.apache.org/thread/bn153gmcobr41x2nwgodvmltlk810hzs
> > 
> >
> > Thanks,
> > Alan
> >
>


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread gongzhongqiang
It's very exciting to hear the news.
+1 for adding CDC Connectors  to Apache Flink !


Best,
Zhongqiang

Leonard Xu  于2023年12月7日周四 11:25写道:

> Dear Flink devs,
>
>
> As you may have heard, we at Alibaba (Ververica) are planning to donate CDC 
> Connectors for the Apache Flink project
> *[1]* to the Apache Flink community.
>
>
>
> CDC Connectors for Apache Flink comprise a collection of source connectors 
> designed specifically for Apache Flink. These connectors
> *[2]*
>  enable the ingestion of changes from various databases using Change Data 
> Capture (CDC), most of these CDC connectors are powered by Debezium
> *[3]*
> . They support both the DataStream API and the Table/SQL API, facilitating 
> the reading of database snapshots and continuous reading of transaction logs 
> with exactly-once processing, even in the event of failures.
>
>
>
> Additionally, in the latest version 3.0, we have introduced many long-awaited 
> features. Starting from CDC version 3.0, we've built a Streaming ELT 
> Framework available for streaming data integration. This framework allows 
> users to write their data synchronization logic in a simple YAML file, which 
> will automatically be translated into a Flink DataStreaming job. It 
> emphasizes optimizing the task submission process and offers advanced 
> functionalities such as whole database synchronization, merging sharded 
> tables, and schema evolution
> *[4]*.
>
>
>
>
> I believe this initiative is a perfect match for both sides. For the Flink 
> community, it presents an opportunity to enhance Flink's competitive 
> advantage in streaming data integration, promoting the healthy growth and 
> prosperity of the Apache Flink ecosystem. For the CDC Connectors project, 
> becoming a sub-project of Apache Flink means being part of a neutral 
> open-source community, which can attract a more diverse pool of contributors.
>
>
> Please note that the aforementioned points represent only some of our 
> motivations and vision for this donation. Specific future operations need to 
> be further discussed in this thread. For example, the sub-project name after 
> the donation; we hope to name it Flink-CDC
> aiming to streaming data intergration through Apache Flink,
> following the naming convention of Flink-ML; And this project is managed by a 
> total of 8 maintainers, including 3 Flink PMC members and 1 Flink Committer. 
> The remaining 4 maintainers are also highly active contributors to the Flink 
> community, donating this project to the Flink community implies that their 
> permissions might be reduced. Therefore, we may need to bring up this topic 
> for further discussion within the Flink PMC. Additionally, we need to discuss 
> how to migrate existing users and documents. We have a user group of nearly 
> 10,000 people and a multi-version documentation site need to migrate. We also 
> need to plan for the migration of CI/CD processes and other specifics.
>
>
>
> While there are many intricate details that require implementation, we are 
> committed to progressing and finalizing this donation process.
>
>
>
> Despite being Flink’s most active ecological project (as evaluated by GitHub 
> metrics), it also boasts a significant user base. However, I believe it's 
> essential to commence discussions on future operations only after the 
> community reaches a consensus on whether they desire this donation.
>
>
> Really looking forward to hear what you think!
>
>
>
> Best,
> Leonard (on behalf of the Flink CDC Connectors project maintainers)
>
> [1] https://github.com/ververica/flink-cdc-connectors
> [2]
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
> [3] https://debezium.io
> [4]
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
>


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Jiabao Sun
Very excited to hear the news, 
+1 for this proposal.

Best,
Jiabao


> 2023年12月7日 13:08,Rui Fan <1996fan...@gmail.com> 写道:
> 
> It's cool!
> 
> Thanks for the great work of CDC,
> +1 for this donation.
> 
> Best,
> Rui
> 
> On Thu, Dec 7, 2023 at 12:28 PM Hongshun Wang 
> wrote:
> 
>> So cool, Big +1 for this exciting work.
>> 
>> Best
>> Hongshun
>> 
>> On Thu, Dec 7, 2023 at 12:20 PM Qingsheng Ren  wrote:
>> 
>>> Thanks for kicking this off, Leonard!
>>> 
>>> As one of the contributors of the CDC project, I'm truly honored to be
>> part
>>> of the community and so excited to hear the news. CDC project was born
>> from
>>> and developed together with Apache Flink, and we are so proud to be
>>> accepted by more and more users around the world.
>>> 
>>> To put my Flink hat on, I believe having the CDC project as a part of
>>> Apache Flink will broadly expand our ecosystem and the usage scenarios.
>>> Both two projects will benefit from closer cooperation, so +1 from my
>> side.
>>> 
>>> Best,
>>> Qingsheng
>>> 
>>> On Thu, Dec 7, 2023 at 12:06 PM Samrat Deb 
>> wrote:
>>> 
 That's really cool :)
 +1 for the great addition
 
 Bests,
 Samrat
 
 On Thu, 7 Dec 2023 at 9:20 AM, Jingsong Li 
>>> wrote:
 
> Wow, Cool, Nice
> 
> CDC is playing an increasingly important role.
> 
> +1
> 
> Best,
> Jingsong
> 
> On Thu, Dec 7, 2023 at 11:25 AM Leonard Xu  wrote:
>> 
>> Dear Flink devs,
>> 
>> As you may have heard, we at Alibaba (Ververica) are planning to
>>> donate
> CDC Connectors for the Apache Flink project[1] to the Apache Flink
> community.
>> 
>> CDC Connectors for Apache Flink comprise a collection of source
> connectors designed specifically for Apache Flink. These connectors[2]
> enable the ingestion of changes from various databases using Change
>> Data
> Capture (CDC), most of these CDC connectors are powered by
>> Debezium[3].
> They support both the DataStream API and the Table/SQL API,
>> facilitating
> the reading of database snapshots and continuous reading of
>> transaction
> logs with exactly-once processing, even in the event of failures.
>> 
>> 
>> Additionally, in the latest version 3.0, we have introduced many
> long-awaited features. Starting from CDC version 3.0, we've built a
> Streaming ELT Framework available for streaming data integration. This
> framework allows users to write their data synchronization logic in a
> simple YAML file, which will automatically be translated into a Flink
> DataStreaming job. It emphasizes optimizing the task submission
>> process
>>> and
> offers advanced functionalities such as whole database
>> synchronization,
> merging sharded tables, and schema evolution[4].
>> 
>> 
>> I believe this initiative is a perfect match for both sides. For the
> Flink community, it presents an opportunity to enhance Flink's
>>> competitive
> advantage in streaming data integration, promoting the healthy growth
>>> and
> prosperity of the Apache Flink ecosystem. For the CDC Connectors
>>> project,
> becoming a sub-project of Apache Flink means being part of a neutral
> open-source community, which can attract a more diverse pool of
> contributors.
>> 
>> Please note that the aforementioned points represent only some of
>> our
> motivations and vision for this donation. Specific future operations
>>> need
> to be further discussed in this thread. For example, the sub-project
>>> name
> after the donation; we hope to name it Flink-CDC aiming to streaming
>>> data
> intergration through Apache Flink, following the naming convention of
> Flink-ML; And this project is managed by a total of 8 maintainers,
> including 3 Flink PMC members and 1 Flink Committer. The remaining 4
> maintainers are also highly active contributors to the Flink
>> community,
> donating this project to the Flink community implies that their
>>> permissions
> might be reduced. Therefore, we may need to bring up this topic for
>>> further
> discussion within the Flink PMC. Additionally, we need to discuss how
>> to
> migrate existing users and documents. We have a user group of nearly
>>> 10,000
> people and a multi-version documentation site need to migrate. We also
>>> need
> to plan for the migration of CI/CD processes and other specifics.
>> 
>> 
>> While there are many intricate details that require implementation,
>> we
> are committed to progressing and finalizing this donation process.
>> 
>> 
>> Despite being Flink’s most active ecological project (as evaluated
>> by
> GitHub metrics), it also boasts a significant user base. However, I
>>> believe
> it's essential to commence discussions on future operations only after
>>> the
> community reaches a consensus on whether they desire this donation.

Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Rui Fan
It's cool!

Thanks for the great work of CDC,
+1 for this donation.

Best,
Rui

On Thu, Dec 7, 2023 at 12:28 PM Hongshun Wang 
wrote:

> So cool, Big +1 for this exciting work.
>
> Best
> Hongshun
>
> On Thu, Dec 7, 2023 at 12:20 PM Qingsheng Ren  wrote:
>
> > Thanks for kicking this off, Leonard!
> >
> > As one of the contributors of the CDC project, I'm truly honored to be
> part
> > of the community and so excited to hear the news. CDC project was born
> from
> > and developed together with Apache Flink, and we are so proud to be
> > accepted by more and more users around the world.
> >
> > To put my Flink hat on, I believe having the CDC project as a part of
> > Apache Flink will broadly expand our ecosystem and the usage scenarios.
> > Both two projects will benefit from closer cooperation, so +1 from my
> side.
> >
> > Best,
> > Qingsheng
> >
> > On Thu, Dec 7, 2023 at 12:06 PM Samrat Deb 
> wrote:
> >
> > > That's really cool :)
> > > +1 for the great addition
> > >
> > > Bests,
> > > Samrat
> > >
> > > On Thu, 7 Dec 2023 at 9:20 AM, Jingsong Li 
> > wrote:
> > >
> > >> Wow, Cool, Nice
> > >>
> > >> CDC is playing an increasingly important role.
> > >>
> > >> +1
> > >>
> > >> Best,
> > >> Jingsong
> > >>
> > >> On Thu, Dec 7, 2023 at 11:25 AM Leonard Xu  wrote:
> > >> >
> > >> > Dear Flink devs,
> > >> >
> > >> > As you may have heard, we at Alibaba (Ververica) are planning to
> > donate
> > >> CDC Connectors for the Apache Flink project[1] to the Apache Flink
> > >> community.
> > >> >
> > >> > CDC Connectors for Apache Flink comprise a collection of source
> > >> connectors designed specifically for Apache Flink. These connectors[2]
> > >> enable the ingestion of changes from various databases using Change
> Data
> > >> Capture (CDC), most of these CDC connectors are powered by
> Debezium[3].
> > >> They support both the DataStream API and the Table/SQL API,
> facilitating
> > >> the reading of database snapshots and continuous reading of
> transaction
> > >> logs with exactly-once processing, even in the event of failures.
> > >> >
> > >> >
> > >> > Additionally, in the latest version 3.0, we have introduced many
> > >> long-awaited features. Starting from CDC version 3.0, we've built a
> > >> Streaming ELT Framework available for streaming data integration. This
> > >> framework allows users to write their data synchronization logic in a
> > >> simple YAML file, which will automatically be translated into a Flink
> > >> DataStreaming job. It emphasizes optimizing the task submission
> process
> > and
> > >> offers advanced functionalities such as whole database
> synchronization,
> > >> merging sharded tables, and schema evolution[4].
> > >> >
> > >> >
> > >> > I believe this initiative is a perfect match for both sides. For the
> > >> Flink community, it presents an opportunity to enhance Flink's
> > competitive
> > >> advantage in streaming data integration, promoting the healthy growth
> > and
> > >> prosperity of the Apache Flink ecosystem. For the CDC Connectors
> > project,
> > >> becoming a sub-project of Apache Flink means being part of a neutral
> > >> open-source community, which can attract a more diverse pool of
> > >> contributors.
> > >> >
> > >> > Please note that the aforementioned points represent only some of
> our
> > >> motivations and vision for this donation. Specific future operations
> > need
> > >> to be further discussed in this thread. For example, the sub-project
> > name
> > >> after the donation; we hope to name it Flink-CDC aiming to streaming
> > data
> > >> intergration through Apache Flink, following the naming convention of
> > >> Flink-ML; And this project is managed by a total of 8 maintainers,
> > >> including 3 Flink PMC members and 1 Flink Committer. The remaining 4
> > >> maintainers are also highly active contributors to the Flink
> community,
> > >> donating this project to the Flink community implies that their
> > permissions
> > >> might be reduced. Therefore, we may need to bring up this topic for
> > further
> > >> discussion within the Flink PMC. Additionally, we need to discuss how
> to
> > >> migrate existing users and documents. We have a user group of nearly
> > 10,000
> > >> people and a multi-version documentation site need to migrate. We also
> > need
> > >> to plan for the migration of CI/CD processes and other specifics.
> > >> >
> > >> >
> > >> > While there are many intricate details that require implementation,
> we
> > >> are committed to progressing and finalizing this donation process.
> > >> >
> > >> >
> > >> > Despite being Flink’s most active ecological project (as evaluated
> by
> > >> GitHub metrics), it also boasts a significant user base. However, I
> > believe
> > >> it's essential to commence discussions on future operations only after
> > the
> > >> community reaches a consensus on whether they desire this donation.
> > >> >
> > >> >
> > >> > Really looking forward to hear what you think!
> > >> >
> > >> >
> > >> > Best,

[DISCUSS] FLIP-399: Flink Connector Doris

2023-12-06 Thread wudi


Hi all,

As discussed in the previous email [1], about contributing the Flink Doris 
Connector to the Flink community.


Apache Doris[2] is a high-performance, real-time analytical database based on 
MPP architecture, for scenarios where Flink is used for data analysis, 
processing, or real-time writing on Doris, Flink Doris Connector is an 
effective tool.

At the same time, Contributing Flink Doris Connector to the Flink community 
will further expand the Flink Connectors ecosystem.

So I would like to start an official discussion FLIP-399: Flink Connector 
Doris[3].

Looking forward to comments, feedbacks and suggestions from the community on 
the proposal.

[1] https://lists.apache.org/thread/lvh8g9o6qj8bt3oh60q81z0o1cv3nn8p
[2] https://doris.apache.org/docs/dev/get-starting/what-is-apache-doris/
[3] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-399%3A+Flink+Connector+Doris


Brs,

di.wu


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Hongshun Wang
So cool, Big +1 for this exciting work.

Best
Hongshun

On Thu, Dec 7, 2023 at 12:20 PM Qingsheng Ren  wrote:

> Thanks for kicking this off, Leonard!
>
> As one of the contributors of the CDC project, I'm truly honored to be part
> of the community and so excited to hear the news. CDC project was born from
> and developed together with Apache Flink, and we are so proud to be
> accepted by more and more users around the world.
>
> To put my Flink hat on, I believe having the CDC project as a part of
> Apache Flink will broadly expand our ecosystem and the usage scenarios.
> Both two projects will benefit from closer cooperation, so +1 from my side.
>
> Best,
> Qingsheng
>
> On Thu, Dec 7, 2023 at 12:06 PM Samrat Deb  wrote:
>
> > That's really cool :)
> > +1 for the great addition
> >
> > Bests,
> > Samrat
> >
> > On Thu, 7 Dec 2023 at 9:20 AM, Jingsong Li 
> wrote:
> >
> >> Wow, Cool, Nice
> >>
> >> CDC is playing an increasingly important role.
> >>
> >> +1
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Thu, Dec 7, 2023 at 11:25 AM Leonard Xu  wrote:
> >> >
> >> > Dear Flink devs,
> >> >
> >> > As you may have heard, we at Alibaba (Ververica) are planning to
> donate
> >> CDC Connectors for the Apache Flink project[1] to the Apache Flink
> >> community.
> >> >
> >> > CDC Connectors for Apache Flink comprise a collection of source
> >> connectors designed specifically for Apache Flink. These connectors[2]
> >> enable the ingestion of changes from various databases using Change Data
> >> Capture (CDC), most of these CDC connectors are powered by Debezium[3].
> >> They support both the DataStream API and the Table/SQL API, facilitating
> >> the reading of database snapshots and continuous reading of transaction
> >> logs with exactly-once processing, even in the event of failures.
> >> >
> >> >
> >> > Additionally, in the latest version 3.0, we have introduced many
> >> long-awaited features. Starting from CDC version 3.0, we've built a
> >> Streaming ELT Framework available for streaming data integration. This
> >> framework allows users to write their data synchronization logic in a
> >> simple YAML file, which will automatically be translated into a Flink
> >> DataStreaming job. It emphasizes optimizing the task submission process
> and
> >> offers advanced functionalities such as whole database synchronization,
> >> merging sharded tables, and schema evolution[4].
> >> >
> >> >
> >> > I believe this initiative is a perfect match for both sides. For the
> >> Flink community, it presents an opportunity to enhance Flink's
> competitive
> >> advantage in streaming data integration, promoting the healthy growth
> and
> >> prosperity of the Apache Flink ecosystem. For the CDC Connectors
> project,
> >> becoming a sub-project of Apache Flink means being part of a neutral
> >> open-source community, which can attract a more diverse pool of
> >> contributors.
> >> >
> >> > Please note that the aforementioned points represent only some of our
> >> motivations and vision for this donation. Specific future operations
> need
> >> to be further discussed in this thread. For example, the sub-project
> name
> >> after the donation; we hope to name it Flink-CDC aiming to streaming
> data
> >> intergration through Apache Flink, following the naming convention of
> >> Flink-ML; And this project is managed by a total of 8 maintainers,
> >> including 3 Flink PMC members and 1 Flink Committer. The remaining 4
> >> maintainers are also highly active contributors to the Flink community,
> >> donating this project to the Flink community implies that their
> permissions
> >> might be reduced. Therefore, we may need to bring up this topic for
> further
> >> discussion within the Flink PMC. Additionally, we need to discuss how to
> >> migrate existing users and documents. We have a user group of nearly
> 10,000
> >> people and a multi-version documentation site need to migrate. We also
> need
> >> to plan for the migration of CI/CD processes and other specifics.
> >> >
> >> >
> >> > While there are many intricate details that require implementation, we
> >> are committed to progressing and finalizing this donation process.
> >> >
> >> >
> >> > Despite being Flink’s most active ecological project (as evaluated by
> >> GitHub metrics), it also boasts a significant user base. However, I
> believe
> >> it's essential to commence discussions on future operations only after
> the
> >> community reaches a consensus on whether they desire this donation.
> >> >
> >> >
> >> > Really looking forward to hear what you think!
> >> >
> >> >
> >> > Best,
> >> > Leonard (on behalf of the Flink CDC Connectors project maintainers)
> >> >
> >> > [1] https://github.com/ververica/flink-cdc-connectors
> >> > [2]
> >>
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
> >> > [3] https://debezium.io
> >> > [4]
> >>
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
> >>
> >

Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Qingsheng Ren
Thanks for kicking this off, Leonard!

As one of the contributors of the CDC project, I'm truly honored to be part
of the community and so excited to hear the news. CDC project was born from
and developed together with Apache Flink, and we are so proud to be
accepted by more and more users around the world.

To put my Flink hat on, I believe having the CDC project as a part of
Apache Flink will broadly expand our ecosystem and the usage scenarios.
Both two projects will benefit from closer cooperation, so +1 from my side.

Best,
Qingsheng

On Thu, Dec 7, 2023 at 12:06 PM Samrat Deb  wrote:

> That's really cool :)
> +1 for the great addition
>
> Bests,
> Samrat
>
> On Thu, 7 Dec 2023 at 9:20 AM, Jingsong Li  wrote:
>
>> Wow, Cool, Nice
>>
>> CDC is playing an increasingly important role.
>>
>> +1
>>
>> Best,
>> Jingsong
>>
>> On Thu, Dec 7, 2023 at 11:25 AM Leonard Xu  wrote:
>> >
>> > Dear Flink devs,
>> >
>> > As you may have heard, we at Alibaba (Ververica) are planning to donate
>> CDC Connectors for the Apache Flink project[1] to the Apache Flink
>> community.
>> >
>> > CDC Connectors for Apache Flink comprise a collection of source
>> connectors designed specifically for Apache Flink. These connectors[2]
>> enable the ingestion of changes from various databases using Change Data
>> Capture (CDC), most of these CDC connectors are powered by Debezium[3].
>> They support both the DataStream API and the Table/SQL API, facilitating
>> the reading of database snapshots and continuous reading of transaction
>> logs with exactly-once processing, even in the event of failures.
>> >
>> >
>> > Additionally, in the latest version 3.0, we have introduced many
>> long-awaited features. Starting from CDC version 3.0, we've built a
>> Streaming ELT Framework available for streaming data integration. This
>> framework allows users to write their data synchronization logic in a
>> simple YAML file, which will automatically be translated into a Flink
>> DataStreaming job. It emphasizes optimizing the task submission process and
>> offers advanced functionalities such as whole database synchronization,
>> merging sharded tables, and schema evolution[4].
>> >
>> >
>> > I believe this initiative is a perfect match for both sides. For the
>> Flink community, it presents an opportunity to enhance Flink's competitive
>> advantage in streaming data integration, promoting the healthy growth and
>> prosperity of the Apache Flink ecosystem. For the CDC Connectors project,
>> becoming a sub-project of Apache Flink means being part of a neutral
>> open-source community, which can attract a more diverse pool of
>> contributors.
>> >
>> > Please note that the aforementioned points represent only some of our
>> motivations and vision for this donation. Specific future operations need
>> to be further discussed in this thread. For example, the sub-project name
>> after the donation; we hope to name it Flink-CDC aiming to streaming data
>> intergration through Apache Flink, following the naming convention of
>> Flink-ML; And this project is managed by a total of 8 maintainers,
>> including 3 Flink PMC members and 1 Flink Committer. The remaining 4
>> maintainers are also highly active contributors to the Flink community,
>> donating this project to the Flink community implies that their permissions
>> might be reduced. Therefore, we may need to bring up this topic for further
>> discussion within the Flink PMC. Additionally, we need to discuss how to
>> migrate existing users and documents. We have a user group of nearly 10,000
>> people and a multi-version documentation site need to migrate. We also need
>> to plan for the migration of CI/CD processes and other specifics.
>> >
>> >
>> > While there are many intricate details that require implementation, we
>> are committed to progressing and finalizing this donation process.
>> >
>> >
>> > Despite being Flink’s most active ecological project (as evaluated by
>> GitHub metrics), it also boasts a significant user base. However, I believe
>> it's essential to commence discussions on future operations only after the
>> community reaches a consensus on whether they desire this donation.
>> >
>> >
>> > Really looking forward to hear what you think!
>> >
>> >
>> > Best,
>> > Leonard (on behalf of the Flink CDC Connectors project maintainers)
>> >
>> > [1] https://github.com/ververica/flink-cdc-connectors
>> > [2]
>> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
>> > [3] https://debezium.io
>> > [4]
>> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
>>
>


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Jark Wu
+1 for adding this to Apache Flink!

I think this can further extend the ability of Apache Flink and a lot of
users would be
interested to try this out.

Best,
Jark

On Thu, 7 Dec 2023 at 12:06, Samrat Deb  wrote:

> That's really cool :)
> +1 for the great addition
>
> Bests,
> Samrat
>
> On Thu, 7 Dec 2023 at 9:20 AM, Jingsong Li  wrote:
>
>> Wow, Cool, Nice
>>
>> CDC is playing an increasingly important role.
>>
>> +1
>>
>> Best,
>> Jingsong
>>
>> On Thu, Dec 7, 2023 at 11:25 AM Leonard Xu  wrote:
>> >
>> > Dear Flink devs,
>> >
>> > As you may have heard, we at Alibaba (Ververica) are planning to donate
>> CDC Connectors for the Apache Flink project[1] to the Apache Flink
>> community.
>> >
>> > CDC Connectors for Apache Flink comprise a collection of source
>> connectors designed specifically for Apache Flink. These connectors[2]
>> enable the ingestion of changes from various databases using Change Data
>> Capture (CDC), most of these CDC connectors are powered by Debezium[3].
>> They support both the DataStream API and the Table/SQL API, facilitating
>> the reading of database snapshots and continuous reading of transaction
>> logs with exactly-once processing, even in the event of failures.
>> >
>> >
>> > Additionally, in the latest version 3.0, we have introduced many
>> long-awaited features. Starting from CDC version 3.0, we've built a
>> Streaming ELT Framework available for streaming data integration. This
>> framework allows users to write their data synchronization logic in a
>> simple YAML file, which will automatically be translated into a Flink
>> DataStreaming job. It emphasizes optimizing the task submission process and
>> offers advanced functionalities such as whole database synchronization,
>> merging sharded tables, and schema evolution[4].
>> >
>> >
>> > I believe this initiative is a perfect match for both sides. For the
>> Flink community, it presents an opportunity to enhance Flink's competitive
>> advantage in streaming data integration, promoting the healthy growth and
>> prosperity of the Apache Flink ecosystem. For the CDC Connectors project,
>> becoming a sub-project of Apache Flink means being part of a neutral
>> open-source community, which can attract a more diverse pool of
>> contributors.
>> >
>> > Please note that the aforementioned points represent only some of our
>> motivations and vision for this donation. Specific future operations need
>> to be further discussed in this thread. For example, the sub-project name
>> after the donation; we hope to name it Flink-CDC aiming to streaming data
>> intergration through Apache Flink, following the naming convention of
>> Flink-ML; And this project is managed by a total of 8 maintainers,
>> including 3 Flink PMC members and 1 Flink Committer. The remaining 4
>> maintainers are also highly active contributors to the Flink community,
>> donating this project to the Flink community implies that their permissions
>> might be reduced. Therefore, we may need to bring up this topic for further
>> discussion within the Flink PMC. Additionally, we need to discuss how to
>> migrate existing users and documents. We have a user group of nearly 10,000
>> people and a multi-version documentation site need to migrate. We also need
>> to plan for the migration of CI/CD processes and other specifics.
>> >
>> >
>> > While there are many intricate details that require implementation, we
>> are committed to progressing and finalizing this donation process.
>> >
>> >
>> > Despite being Flink’s most active ecological project (as evaluated by
>> GitHub metrics), it also boasts a significant user base. However, I believe
>> it's essential to commence discussions on future operations only after the
>> community reaches a consensus on whether they desire this donation.
>> >
>> >
>> > Really looking forward to hear what you think!
>> >
>> >
>> > Best,
>> > Leonard (on behalf of the Flink CDC Connectors project maintainers)
>> >
>> > [1] https://github.com/ververica/flink-cdc-connectors
>> > [2]
>> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
>> > [3] https://debezium.io
>> > [4]
>> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
>>
>


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Xintong Song
A big +1 for this proposal.

Thanks Leonard and the Flink CDC community. While there are many further
details to be discussed, I believe this proposal is aligned with the
long-term interests for both communities.

Best,

Xintong



On Thu, Dec 7, 2023 at 12:06 PM Samrat Deb  wrote:

> That's really cool :)
> +1 for the great addition
>
> Bests,
> Samrat
>
> On Thu, 7 Dec 2023 at 9:20 AM, Jingsong Li  wrote:
>
> > Wow, Cool, Nice
> >
> > CDC is playing an increasingly important role.
> >
> > +1
> >
> > Best,
> > Jingsong
> >
> > On Thu, Dec 7, 2023 at 11:25 AM Leonard Xu  wrote:
> > >
> > > Dear Flink devs,
> > >
> > > As you may have heard, we at Alibaba (Ververica) are planning to donate
> > CDC Connectors for the Apache Flink project[1] to the Apache Flink
> > community.
> > >
> > > CDC Connectors for Apache Flink comprise a collection of source
> > connectors designed specifically for Apache Flink. These connectors[2]
> > enable the ingestion of changes from various databases using Change Data
> > Capture (CDC), most of these CDC connectors are powered by Debezium[3].
> > They support both the DataStream API and the Table/SQL API, facilitating
> > the reading of database snapshots and continuous reading of transaction
> > logs with exactly-once processing, even in the event of failures.
> > >
> > >
> > > Additionally, in the latest version 3.0, we have introduced many
> > long-awaited features. Starting from CDC version 3.0, we've built a
> > Streaming ELT Framework available for streaming data integration. This
> > framework allows users to write their data synchronization logic in a
> > simple YAML file, which will automatically be translated into a Flink
> > DataStreaming job. It emphasizes optimizing the task submission process
> and
> > offers advanced functionalities such as whole database synchronization,
> > merging sharded tables, and schema evolution[4].
> > >
> > >
> > > I believe this initiative is a perfect match for both sides. For the
> > Flink community, it presents an opportunity to enhance Flink's
> competitive
> > advantage in streaming data integration, promoting the healthy growth and
> > prosperity of the Apache Flink ecosystem. For the CDC Connectors project,
> > becoming a sub-project of Apache Flink means being part of a neutral
> > open-source community, which can attract a more diverse pool of
> > contributors.
> > >
> > > Please note that the aforementioned points represent only some of our
> > motivations and vision for this donation. Specific future operations need
> > to be further discussed in this thread. For example, the sub-project name
> > after the donation; we hope to name it Flink-CDC aiming to streaming data
> > intergration through Apache Flink, following the naming convention of
> > Flink-ML; And this project is managed by a total of 8 maintainers,
> > including 3 Flink PMC members and 1 Flink Committer. The remaining 4
> > maintainers are also highly active contributors to the Flink community,
> > donating this project to the Flink community implies that their
> permissions
> > might be reduced. Therefore, we may need to bring up this topic for
> further
> > discussion within the Flink PMC. Additionally, we need to discuss how to
> > migrate existing users and documents. We have a user group of nearly
> 10,000
> > people and a multi-version documentation site need to migrate. We also
> need
> > to plan for the migration of CI/CD processes and other specifics.
> > >
> > >
> > > While there are many intricate details that require implementation, we
> > are committed to progressing and finalizing this donation process.
> > >
> > >
> > > Despite being Flink’s most active ecological project (as evaluated by
> > GitHub metrics), it also boasts a significant user base. However, I
> believe
> > it's essential to commence discussions on future operations only after
> the
> > community reaches a consensus on whether they desire this donation.
> > >
> > >
> > > Really looking forward to hear what you think!
> > >
> > >
> > > Best,
> > > Leonard (on behalf of the Flink CDC Connectors project maintainers)
> > >
> > > [1] https://github.com/ververica/flink-cdc-connectors
> > > [2]
> >
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
> > > [3] https://debezium.io
> > > [4]
> >
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
> >
>


Re:Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Xuyang
Big +1 for this exciting work.




--

Best!
Xuyang





在 2023-12-07 12:06:07,"Samrat Deb"  写道:
>That's really cool :)
>+1 for the great addition
>
>Bests,
>Samrat
>
>On Thu, 7 Dec 2023 at 9:20 AM, Jingsong Li  wrote:
>
>> Wow, Cool, Nice
>>
>> CDC is playing an increasingly important role.
>>
>> +1
>>
>> Best,
>> Jingsong
>>
>> On Thu, Dec 7, 2023 at 11:25 AM Leonard Xu  wrote:
>> >
>> > Dear Flink devs,
>> >
>> > As you may have heard, we at Alibaba (Ververica) are planning to donate
>> CDC Connectors for the Apache Flink project[1] to the Apache Flink
>> community.
>> >
>> > CDC Connectors for Apache Flink comprise a collection of source
>> connectors designed specifically for Apache Flink. These connectors[2]
>> enable the ingestion of changes from various databases using Change Data
>> Capture (CDC), most of these CDC connectors are powered by Debezium[3].
>> They support both the DataStream API and the Table/SQL API, facilitating
>> the reading of database snapshots and continuous reading of transaction
>> logs with exactly-once processing, even in the event of failures.
>> >
>> >
>> > Additionally, in the latest version 3.0, we have introduced many
>> long-awaited features. Starting from CDC version 3.0, we've built a
>> Streaming ELT Framework available for streaming data integration. This
>> framework allows users to write their data synchronization logic in a
>> simple YAML file, which will automatically be translated into a Flink
>> DataStreaming job. It emphasizes optimizing the task submission process and
>> offers advanced functionalities such as whole database synchronization,
>> merging sharded tables, and schema evolution[4].
>> >
>> >
>> > I believe this initiative is a perfect match for both sides. For the
>> Flink community, it presents an opportunity to enhance Flink's competitive
>> advantage in streaming data integration, promoting the healthy growth and
>> prosperity of the Apache Flink ecosystem. For the CDC Connectors project,
>> becoming a sub-project of Apache Flink means being part of a neutral
>> open-source community, which can attract a more diverse pool of
>> contributors.
>> >
>> > Please note that the aforementioned points represent only some of our
>> motivations and vision for this donation. Specific future operations need
>> to be further discussed in this thread. For example, the sub-project name
>> after the donation; we hope to name it Flink-CDC aiming to streaming data
>> intergration through Apache Flink, following the naming convention of
>> Flink-ML; And this project is managed by a total of 8 maintainers,
>> including 3 Flink PMC members and 1 Flink Committer. The remaining 4
>> maintainers are also highly active contributors to the Flink community,
>> donating this project to the Flink community implies that their permissions
>> might be reduced. Therefore, we may need to bring up this topic for further
>> discussion within the Flink PMC. Additionally, we need to discuss how to
>> migrate existing users and documents. We have a user group of nearly 10,000
>> people and a multi-version documentation site need to migrate. We also need
>> to plan for the migration of CI/CD processes and other specifics.
>> >
>> >
>> > While there are many intricate details that require implementation, we
>> are committed to progressing and finalizing this donation process.
>> >
>> >
>> > Despite being Flink’s most active ecological project (as evaluated by
>> GitHub metrics), it also boasts a significant user base. However, I believe
>> it's essential to commence discussions on future operations only after the
>> community reaches a consensus on whether they desire this donation.
>> >
>> >
>> > Really looking forward to hear what you think!
>> >
>> >
>> > Best,
>> > Leonard (on behalf of the Flink CDC Connectors project maintainers)
>> >
>> > [1] https://github.com/ververica/flink-cdc-connectors
>> > [2]
>> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
>> > [3] https://debezium.io
>> > [4]
>> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
>>


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Samrat Deb
That's really cool :)
+1 for the great addition

Bests,
Samrat

On Thu, 7 Dec 2023 at 9:20 AM, Jingsong Li  wrote:

> Wow, Cool, Nice
>
> CDC is playing an increasingly important role.
>
> +1
>
> Best,
> Jingsong
>
> On Thu, Dec 7, 2023 at 11:25 AM Leonard Xu  wrote:
> >
> > Dear Flink devs,
> >
> > As you may have heard, we at Alibaba (Ververica) are planning to donate
> CDC Connectors for the Apache Flink project[1] to the Apache Flink
> community.
> >
> > CDC Connectors for Apache Flink comprise a collection of source
> connectors designed specifically for Apache Flink. These connectors[2]
> enable the ingestion of changes from various databases using Change Data
> Capture (CDC), most of these CDC connectors are powered by Debezium[3].
> They support both the DataStream API and the Table/SQL API, facilitating
> the reading of database snapshots and continuous reading of transaction
> logs with exactly-once processing, even in the event of failures.
> >
> >
> > Additionally, in the latest version 3.0, we have introduced many
> long-awaited features. Starting from CDC version 3.0, we've built a
> Streaming ELT Framework available for streaming data integration. This
> framework allows users to write their data synchronization logic in a
> simple YAML file, which will automatically be translated into a Flink
> DataStreaming job. It emphasizes optimizing the task submission process and
> offers advanced functionalities such as whole database synchronization,
> merging sharded tables, and schema evolution[4].
> >
> >
> > I believe this initiative is a perfect match for both sides. For the
> Flink community, it presents an opportunity to enhance Flink's competitive
> advantage in streaming data integration, promoting the healthy growth and
> prosperity of the Apache Flink ecosystem. For the CDC Connectors project,
> becoming a sub-project of Apache Flink means being part of a neutral
> open-source community, which can attract a more diverse pool of
> contributors.
> >
> > Please note that the aforementioned points represent only some of our
> motivations and vision for this donation. Specific future operations need
> to be further discussed in this thread. For example, the sub-project name
> after the donation; we hope to name it Flink-CDC aiming to streaming data
> intergration through Apache Flink, following the naming convention of
> Flink-ML; And this project is managed by a total of 8 maintainers,
> including 3 Flink PMC members and 1 Flink Committer. The remaining 4
> maintainers are also highly active contributors to the Flink community,
> donating this project to the Flink community implies that their permissions
> might be reduced. Therefore, we may need to bring up this topic for further
> discussion within the Flink PMC. Additionally, we need to discuss how to
> migrate existing users and documents. We have a user group of nearly 10,000
> people and a multi-version documentation site need to migrate. We also need
> to plan for the migration of CI/CD processes and other specifics.
> >
> >
> > While there are many intricate details that require implementation, we
> are committed to progressing and finalizing this donation process.
> >
> >
> > Despite being Flink’s most active ecological project (as evaluated by
> GitHub metrics), it also boasts a significant user base. However, I believe
> it's essential to commence discussions on future operations only after the
> community reaches a consensus on whether they desire this donation.
> >
> >
> > Really looking forward to hear what you think!
> >
> >
> > Best,
> > Leonard (on behalf of the Flink CDC Connectors project maintainers)
> >
> > [1] https://github.com/ververica/flink-cdc-connectors
> > [2]
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
> > [3] https://debezium.io
> > [4]
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
>


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Shengkai Fang
Thanks for all Flink CDC maintainers great work! Big +1.

Best,
Shengkai

tison  于2023年12月7日周四 12:01写道:

> This is very cool! +1 from my side.
>
> Best,
> tison.
>
> Benchao Li  于2023年12月7日周四 11:56写道:
> >
> > Thank you, Leonard and all the Flink CDC maintainers.
> >
> > Big big +1 from me. As a heavy user of both Flink and Flink CDC, I've
> > already taken them as a whole project.
> >
> > Leonard Xu  于2023年12月7日周四 11:25写道:
> > >
> > > Dear Flink devs,
> > >
> > > As you may have heard, we at Alibaba (Ververica) are planning to
> donate CDC Connectors for the Apache Flink project[1] to the Apache Flink
> community.
> > >
> > > CDC Connectors for Apache Flink comprise a collection of source
> connectors designed specifically for Apache Flink. These connectors[2]
> enable the ingestion of changes from various databases using Change Data
> Capture (CDC), most of these CDC connectors are powered by Debezium[3].
> They support both the DataStream API and the Table/SQL API, facilitating
> the reading of database snapshots and continuous reading of transaction
> logs with exactly-once processing, even in the event of failures.
> > >
> > >
> > > Additionally, in the latest version 3.0, we have introduced many
> long-awaited features. Starting from CDC version 3.0, we've built a
> Streaming ELT Framework available for streaming data integration. This
> framework allows users to write their data synchronization logic in a
> simple YAML file, which will automatically be translated into a Flink
> DataStreaming job. It emphasizes optimizing the task submission process and
> offers advanced functionalities such as whole database synchronization,
> merging sharded tables, and schema evolution[4].
> > >
> > >
> > > I believe this initiative is a perfect match for both sides. For the
> Flink community, it presents an opportunity to enhance Flink's competitive
> advantage in streaming data integration, promoting the healthy growth and
> prosperity of the Apache Flink ecosystem. For the CDC Connectors project,
> becoming a sub-project of Apache Flink means being part of a neutral
> open-source community, which can attract a more diverse pool of
> contributors.
> > >
> > > Please note that the aforementioned points represent only some of our
> motivations and vision for this donation. Specific future operations need
> to be further discussed in this thread. For example, the sub-project name
> after the donation; we hope to name it Flink-CDC aiming to streaming data
> intergration through Apache Flink, following the naming convention of
> Flink-ML; And this project is managed by a total of 8 maintainers,
> including 3 Flink PMC members and 1 Flink Committer. The remaining 4
> maintainers are also highly active contributors to the Flink community,
> donating this project to the Flink community implies that their permissions
> might be reduced. Therefore, we may need to bring up this topic for further
> discussion within the Flink PMC. Additionally, we need to discuss how to
> migrate existing users and documents. We have a user group of nearly 10,000
> people and a multi-version documentation site need to migrate. We also need
> to plan for the migration of CI/CD processes and other specifics.
> > >
> > >
> > > While there are many intricate details that require implementation, we
> are committed to progressing and finalizing this donation process.
> > >
> > >
> > > Despite being Flink’s most active ecological project (as evaluated by
> GitHub metrics), it also boasts a significant user base. However, I believe
> it's essential to commence discussions on future operations only after the
> community reaches a consensus on whether they desire this donation.
> > >
> > >
> > > Really looking forward to hear what you think!
> > >
> > >
> > > Best,
> > > Leonard (on behalf of the Flink CDC Connectors project maintainers)
> > >
> > > [1] https://github.com/ververica/flink-cdc-connectors
> > > [2]
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
> > > [3] https://debezium.io
> > > [4]
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
> >
> >
> >
> > --
> >
> > Best,
> > Benchao Li
>


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread tison
This is very cool! +1 from my side.

Best,
tison.

Benchao Li  于2023年12月7日周四 11:56写道:
>
> Thank you, Leonard and all the Flink CDC maintainers.
>
> Big big +1 from me. As a heavy user of both Flink and Flink CDC, I've
> already taken them as a whole project.
>
> Leonard Xu  于2023年12月7日周四 11:25写道:
> >
> > Dear Flink devs,
> >
> > As you may have heard, we at Alibaba (Ververica) are planning to donate CDC 
> > Connectors for the Apache Flink project[1] to the Apache Flink community.
> >
> > CDC Connectors for Apache Flink comprise a collection of source connectors 
> > designed specifically for Apache Flink. These connectors[2] enable the 
> > ingestion of changes from various databases using Change Data Capture 
> > (CDC), most of these CDC connectors are powered by Debezium[3]. They 
> > support both the DataStream API and the Table/SQL API, facilitating the 
> > reading of database snapshots and continuous reading of transaction logs 
> > with exactly-once processing, even in the event of failures.
> >
> >
> > Additionally, in the latest version 3.0, we have introduced many 
> > long-awaited features. Starting from CDC version 3.0, we've built a 
> > Streaming ELT Framework available for streaming data integration. This 
> > framework allows users to write their data synchronization logic in a 
> > simple YAML file, which will automatically be translated into a Flink 
> > DataStreaming job. It emphasizes optimizing the task submission process and 
> > offers advanced functionalities such as whole database synchronization, 
> > merging sharded tables, and schema evolution[4].
> >
> >
> > I believe this initiative is a perfect match for both sides. For the Flink 
> > community, it presents an opportunity to enhance Flink's competitive 
> > advantage in streaming data integration, promoting the healthy growth and 
> > prosperity of the Apache Flink ecosystem. For the CDC Connectors project, 
> > becoming a sub-project of Apache Flink means being part of a neutral 
> > open-source community, which can attract a more diverse pool of 
> > contributors.
> >
> > Please note that the aforementioned points represent only some of our 
> > motivations and vision for this donation. Specific future operations need 
> > to be further discussed in this thread. For example, the sub-project name 
> > after the donation; we hope to name it Flink-CDC aiming to streaming data 
> > intergration through Apache Flink, following the naming convention of 
> > Flink-ML; And this project is managed by a total of 8 maintainers, 
> > including 3 Flink PMC members and 1 Flink Committer. The remaining 4 
> > maintainers are also highly active contributors to the Flink community, 
> > donating this project to the Flink community implies that their permissions 
> > might be reduced. Therefore, we may need to bring up this topic for further 
> > discussion within the Flink PMC. Additionally, we need to discuss how to 
> > migrate existing users and documents. We have a user group of nearly 10,000 
> > people and a multi-version documentation site need to migrate. We also need 
> > to plan for the migration of CI/CD processes and other specifics.
> >
> >
> > While there are many intricate details that require implementation, we are 
> > committed to progressing and finalizing this donation process.
> >
> >
> > Despite being Flink’s most active ecological project (as evaluated by 
> > GitHub metrics), it also boasts a significant user base. However, I believe 
> > it's essential to commence discussions on future operations only after the 
> > community reaches a consensus on whether they desire this donation.
> >
> >
> > Really looking forward to hear what you think!
> >
> >
> > Best,
> > Leonard (on behalf of the Flink CDC Connectors project maintainers)
> >
> > [1] https://github.com/ververica/flink-cdc-connectors
> > [2] 
> > https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
> > [3] https://debezium.io
> > [4] 
> > https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
>
>
>
> --
>
> Best,
> Benchao Li


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Benchao Li
Thank you, Leonard and all the Flink CDC maintainers.

Big big +1 from me. As a heavy user of both Flink and Flink CDC, I've
already taken them as a whole project.

Leonard Xu  于2023年12月7日周四 11:25写道:
>
> Dear Flink devs,
>
> As you may have heard, we at Alibaba (Ververica) are planning to donate CDC 
> Connectors for the Apache Flink project[1] to the Apache Flink community.
>
> CDC Connectors for Apache Flink comprise a collection of source connectors 
> designed specifically for Apache Flink. These connectors[2] enable the 
> ingestion of changes from various databases using Change Data Capture (CDC), 
> most of these CDC connectors are powered by Debezium[3]. They support both 
> the DataStream API and the Table/SQL API, facilitating the reading of 
> database snapshots and continuous reading of transaction logs with 
> exactly-once processing, even in the event of failures.
>
>
> Additionally, in the latest version 3.0, we have introduced many long-awaited 
> features. Starting from CDC version 3.0, we've built a Streaming ELT 
> Framework available for streaming data integration. This framework allows 
> users to write their data synchronization logic in a simple YAML file, which 
> will automatically be translated into a Flink DataStreaming job. It 
> emphasizes optimizing the task submission process and offers advanced 
> functionalities such as whole database synchronization, merging sharded 
> tables, and schema evolution[4].
>
>
> I believe this initiative is a perfect match for both sides. For the Flink 
> community, it presents an opportunity to enhance Flink's competitive 
> advantage in streaming data integration, promoting the healthy growth and 
> prosperity of the Apache Flink ecosystem. For the CDC Connectors project, 
> becoming a sub-project of Apache Flink means being part of a neutral 
> open-source community, which can attract a more diverse pool of contributors.
>
> Please note that the aforementioned points represent only some of our 
> motivations and vision for this donation. Specific future operations need to 
> be further discussed in this thread. For example, the sub-project name after 
> the donation; we hope to name it Flink-CDC aiming to streaming data 
> intergration through Apache Flink, following the naming convention of 
> Flink-ML; And this project is managed by a total of 8 maintainers, including 
> 3 Flink PMC members and 1 Flink Committer. The remaining 4 maintainers are 
> also highly active contributors to the Flink community, donating this project 
> to the Flink community implies that their permissions might be reduced. 
> Therefore, we may need to bring up this topic for further discussion within 
> the Flink PMC. Additionally, we need to discuss how to migrate existing users 
> and documents. We have a user group of nearly 10,000 people and a 
> multi-version documentation site need to migrate. We also need to plan for 
> the migration of CI/CD processes and other specifics.
>
>
> While there are many intricate details that require implementation, we are 
> committed to progressing and finalizing this donation process.
>
>
> Despite being Flink’s most active ecological project (as evaluated by GitHub 
> metrics), it also boasts a significant user base. However, I believe it's 
> essential to commence discussions on future operations only after the 
> community reaches a consensus on whether they desire this donation.
>
>
> Really looking forward to hear what you think!
>
>
> Best,
> Leonard (on behalf of the Flink CDC Connectors project maintainers)
>
> [1] https://github.com/ververica/flink-cdc-connectors
> [2] 
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
> [3] https://debezium.io
> [4] 
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html



-- 

Best,
Benchao Li


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Jingsong Li
Wow, Cool, Nice

CDC is playing an increasingly important role.

+1

Best,
Jingsong

On Thu, Dec 7, 2023 at 11:25 AM Leonard Xu  wrote:
>
> Dear Flink devs,
>
> As you may have heard, we at Alibaba (Ververica) are planning to donate CDC 
> Connectors for the Apache Flink project[1] to the Apache Flink community.
>
> CDC Connectors for Apache Flink comprise a collection of source connectors 
> designed specifically for Apache Flink. These connectors[2] enable the 
> ingestion of changes from various databases using Change Data Capture (CDC), 
> most of these CDC connectors are powered by Debezium[3]. They support both 
> the DataStream API and the Table/SQL API, facilitating the reading of 
> database snapshots and continuous reading of transaction logs with 
> exactly-once processing, even in the event of failures.
>
>
> Additionally, in the latest version 3.0, we have introduced many long-awaited 
> features. Starting from CDC version 3.0, we've built a Streaming ELT 
> Framework available for streaming data integration. This framework allows 
> users to write their data synchronization logic in a simple YAML file, which 
> will automatically be translated into a Flink DataStreaming job. It 
> emphasizes optimizing the task submission process and offers advanced 
> functionalities such as whole database synchronization, merging sharded 
> tables, and schema evolution[4].
>
>
> I believe this initiative is a perfect match for both sides. For the Flink 
> community, it presents an opportunity to enhance Flink's competitive 
> advantage in streaming data integration, promoting the healthy growth and 
> prosperity of the Apache Flink ecosystem. For the CDC Connectors project, 
> becoming a sub-project of Apache Flink means being part of a neutral 
> open-source community, which can attract a more diverse pool of contributors.
>
> Please note that the aforementioned points represent only some of our 
> motivations and vision for this donation. Specific future operations need to 
> be further discussed in this thread. For example, the sub-project name after 
> the donation; we hope to name it Flink-CDC aiming to streaming data 
> intergration through Apache Flink, following the naming convention of 
> Flink-ML; And this project is managed by a total of 8 maintainers, including 
> 3 Flink PMC members and 1 Flink Committer. The remaining 4 maintainers are 
> also highly active contributors to the Flink community, donating this project 
> to the Flink community implies that their permissions might be reduced. 
> Therefore, we may need to bring up this topic for further discussion within 
> the Flink PMC. Additionally, we need to discuss how to migrate existing users 
> and documents. We have a user group of nearly 10,000 people and a 
> multi-version documentation site need to migrate. We also need to plan for 
> the migration of CI/CD processes and other specifics.
>
>
> While there are many intricate details that require implementation, we are 
> committed to progressing and finalizing this donation process.
>
>
> Despite being Flink’s most active ecological project (as evaluated by GitHub 
> metrics), it also boasts a significant user base. However, I believe it's 
> essential to commence discussions on future operations only after the 
> community reaches a consensus on whether they desire this donation.
>
>
> Really looking forward to hear what you think!
>
>
> Best,
> Leonard (on behalf of the Flink CDC Connectors project maintainers)
>
> [1] https://github.com/ververica/flink-cdc-connectors
> [2] 
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
> [3] https://debezium.io
> [4] 
> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html


[PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Leonard Xu
Dear Flink devs,

As you may have heard, we at Alibaba (Ververica) are planning to donate CDC 
Connectors for the Apache Flink project[1] to the Apache Flink community.

CDC Connectors for Apache Flink comprise a collection of source connectors 
designed specifically for Apache Flink. These connectors[2] enable the 
ingestion of changes from various databases using Change Data Capture (CDC), 
most of these CDC connectors are powered by Debezium[3]. They support both the 
DataStream API and the Table/SQL API, facilitating the reading of database 
snapshots and continuous reading of transaction logs with exactly-once 
processing, even in the event of failures.


Additionally, in the latest version 3.0, we have introduced many long-awaited 
features. Starting from CDC version 3.0, we've built a Streaming ELT Framework 
available for streaming data integration. This framework allows users to write 
their data synchronization logic in a simple YAML file, which will 
automatically be translated into a Flink DataStreaming job. It emphasizes 
optimizing the task submission process and offers advanced functionalities such 
as whole database synchronization, merging sharded tables, and schema 
evolution[4].


I believe this initiative is a perfect match for both sides. For the Flink 
community, it presents an opportunity to enhance Flink's competitive advantage 
in streaming data integration, promoting the healthy growth and prosperity of 
the Apache Flink ecosystem. For the CDC Connectors project, becoming a 
sub-project of Apache Flink means being part of a neutral open-source 
community, which can attract a more diverse pool of contributors.

Please note that the aforementioned points represent only some of our 
motivations and vision for this donation. Specific future operations need to be 
further discussed in this thread. For example, the sub-project name after the 
donation; we hope to name it Flink-CDC aiming to streaming data intergration 
through Apache Flink, following the naming convention of Flink-ML; And this 
project is managed by a total of 8 maintainers, including 3 Flink PMC members 
and 1 Flink Committer. The remaining 4 maintainers are also highly active 
contributors to the Flink community, donating this project to the Flink 
community implies that their permissions might be reduced. Therefore, we may 
need to bring up this topic for further discussion within the Flink PMC. 
Additionally, we need to discuss how to migrate existing users and documents. 
We have a user group of nearly 10,000 people and a multi-version documentation 
site need to migrate. We also need to plan for the migration of CI/CD processes 
and other specifics. 


While there are many intricate details that require implementation, we are 
committed to progressing and finalizing this donation process.


Despite being Flink’s most active ecological project (as evaluated by GitHub 
metrics), it also boasts a significant user base. However, I believe it's 
essential to commence discussions on future operations only after the community 
reaches a consensus on whether they desire this donation.


Really looking forward to hear what you think! 


Best,
Leonard (on behalf of the Flink CDC Connectors project maintainers)

[1] https://github.com/ververica/flink-cdc-connectors
[2] 
https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
[3] https://debezium.io
[4] 
https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html

[DISCUSS] FLIP-398: Improve Serialization Configuration And Usage In Flink

2023-12-06 Thread Yong Fang
Hi devs,

I'd like to start a discussion about FLIP-398: Improve Serialization
Configuration And Usage In Flink [1].

Currently, users can register custom data types and serializers in Flink
jobs through various methods, including registration in code,
configuration, and annotations. These lead to difficulties in upgrading
Flink jobs and priority issues.

In flink-2.0 we would like to manage job data types and serializers through
configurations. This FLIP will introduce a unified option for data type and
serializer and users can configure all custom data types and
pojo/kryo/custom serializers. In addition, this FLIP will add more built-in
serializers for complex data types such as List and Map, and optimize the
management of Avro Serializers.

Looking forward to hearing from you, thanks!

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-398%3A+Improve+Serialization+Configuration+And+Usage+In+Flink

Best,
Fang Yong


[jira] [Created] (FLINK-33768) Support dynamic source parallelism inference for batch jobs

2023-12-06 Thread xingbe (Jira)
xingbe created FLINK-33768:
--

 Summary: Support dynamic source parallelism inference for batch 
jobs
 Key: FLINK-33768
 URL: https://issues.apache.org/jira/browse/FLINK-33768
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.19.0
Reporter: xingbe


Support dynamic source parallelism inference for batch jobs, 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-400: AsyncScalarFunction for asynchronous scalar function support

2023-12-06 Thread Jim Hughes
Hi Alan,

Nicely written and makes sense.  The only feedback I have is around the
naming of the generalization, e.g. "Specifically, PythonCalcSplitRuleBase
will be generalized into RemoteCalcSplitRuleBase."  This naming seems to
imply/suggest that all Async functions are remote.  I wonder if we can find
another name which doesn't carry that connotation; maybe
AsyncCalcSplitRuleBase.  (An AsyncCalcSplitRuleBase which handles Python
and Async functions seems reasonable.)

Cheers,

Jim

On Wed, Dec 6, 2023 at 5:45 PM Alan Sheinberg
 wrote:

> I'd like to start a discussion of FLIP-400: AsyncScalarFunction for
> asynchronous scalar function support [1]
>
> This feature proposes adding a new UDF type AsyncScalarFunction which is
> invoked just like a normal ScalarFunction, but is implemented with an
> asynchronous eval method.  I had brought this up including the motivation
> in a previous discussion thread [2].
>
> The purpose is to achieve high throughput scalar function UDFs while
> allowing that an individual call may have high latency.  It allows scaling
> up the parallelism of just these calls without having to increase the
> parallelism of the whole query (which could be rather resource
> inefficient).
>
> In practice, it should enable SQL integration with external services and
> systems, which Flink has limited support for at the moment. It should also
> allow easier integration with existing libraries which use asynchronous
> APIs.
>
> Looking forward to your feedback and suggestions.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-400%3A+AsyncScalarFunction+for+asynchronous+scalar+function+support
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-400%3A+AsyncScalarFunction+for+asynchronous+scalar+function+support
> >
>
> [2] https://lists.apache.org/thread/bn153gmcobr41x2nwgodvmltlk810hzs
> 
>
> Thanks,
> Alan
>


[DISCUSS] FLIP-400: AsyncScalarFunction for asynchronous scalar function support

2023-12-06 Thread Alan Sheinberg
I'd like to start a discussion of FLIP-400: AsyncScalarFunction for
asynchronous scalar function support [1]

This feature proposes adding a new UDF type AsyncScalarFunction which is
invoked just like a normal ScalarFunction, but is implemented with an
asynchronous eval method.  I had brought this up including the motivation
in a previous discussion thread [2].

The purpose is to achieve high throughput scalar function UDFs while
allowing that an individual call may have high latency.  It allows scaling
up the parallelism of just these calls without having to increase the
parallelism of the whole query (which could be rather resource
inefficient).

In practice, it should enable SQL integration with external services and
systems, which Flink has limited support for at the moment. It should also
allow easier integration with existing libraries which use asynchronous
APIs.

Looking forward to your feedback and suggestions.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-400%3A+AsyncScalarFunction+for+asynchronous+scalar+function+support


[2] https://lists.apache.org/thread/bn153gmcobr41x2nwgodvmltlk810hzs


Thanks,
Alan


[jira] [Created] (FLINK-33767) Implement restore tests for TemporalJoin node

2023-12-06 Thread Jim Hughes (Jira)
Jim Hughes created FLINK-33767:
--

 Summary: Implement restore tests for TemporalJoin node
 Key: FLINK-33767
 URL: https://issues.apache.org/jira/browse/FLINK-33767
 Project: Flink
  Issue Type: Sub-task
Reporter: Jim Hughes
Assignee: Jim Hughes






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33766) Support MockEnvironment in BroadcastOperatorTestHarness

2023-12-06 Thread Koala Lam (Jira)
Koala Lam created FLINK-33766:
-

 Summary: Support MockEnvironment in BroadcastOperatorTestHarness
 Key: FLINK-33766
 URL: https://issues.apache.org/jira/browse/FLINK-33766
 Project: Flink
  Issue Type: Improvement
  Components: Test Infrastructure
Affects Versions: 1.17.1
Reporter: Koala Lam


Unlike KeyedOneInputStreamOperatorTestHarness, no constructor is available for 
providing MockEnvironment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33765) Flink SQL to support COLLECTLIST

2023-12-06 Thread Zhenzhong Xu (Jira)
Zhenzhong Xu created FLINK-33765:


 Summary: Flink SQL to support COLLECTLIST
 Key: FLINK-33765
 URL: https://issues.apache.org/jira/browse/FLINK-33765
 Project: Flink
  Issue Type: Improvement
  Components: API / DataSet
Reporter: Zhenzhong Xu


Flink SQL currently supports COLLECT, which returns a multiset, however, given 
support for casting from multiset to other types (especially array/list) is 
*very* limited, see 
[here,|https://github.com/apache/flink/blob/master/docs/content/docs/dev/table/types.md#casting]
 this is creating lots of headaches for ease of use.

Can we support COLLECT_LIST as a built-in system function?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [jira] [Created] (FLINK-33753) ContinuousFileReaderOperator consume records as mini batch

2023-12-06 Thread Prabhu Joseph
I understand the operator wouldn't be causing the checkpoint issue based on
your comments. I need to debug
further, but the symptoms are not showing up again for my Cx. I will close
this as not a problem. Thanks for the
detailed explanations.

On Wed, Dec 6, 2023 at 8:22 PM Darin Amos 
wrote:

> Just to confirm, I ran a test this morning using a large file with only a
> single file split that took over 10 minutes to process. My
> checkpoints still executed as expected every minute with minimal latency
> (sub second). Though I ran this with 1.15.4.
>
> What evidence do you have that this operator is causing the timeout?
> Typically this operator should handle the checkpoint barrier quickly and
> proceed with reading (ignoring alignment). In my early days I suspected a
> similar issue but it was other checkpoint alignment issues.
>
> Darin
>
> On Wed, Dec 6, 2023 at 4:13 AM Prabhu Joseph 
> wrote:
>
> > Thanks Darin for the details.
> >
> > Below is the problematic StreamTask, which has not processed the priority
> > event (checkpoint barrier) from the
> > mailbox, and has been there throughout the checkpoint timeout interval.
> And
> > it is reading and collecting all the records from the split.
> > Not sure why the executor was idle during that period, even though I
> could
> > see the upstream source task checkpoint acknowledgement
> > message was received by the JobManager. The Flink Version is 1.16.
> >
> > at
> >
> >
> org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.readAndCollectRecord(ContinuousFileReaderOperator.java:400)
> >
> > at
> >
> >
> org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.processRecord(ContinuousFileReaderOperator.java:364)
> >
> > at
> >
> >
> org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.lambda$new$0(ContinuousFileReaderOperator.java:240)
> >
> > at
> >
> >
> org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator$$Lambda$1199/0x7f3d2b07a858.run(Unknown
> > Source)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
> >
> > at
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield(MailboxExecutorImpl.java:86)
> >
> > at
> >
> >
> org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.finish(ContinuousFileReaderOperator.java:466)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper$$Lambda$1827/0x7f3d2482c058.run(Unknown
> > Source)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.finishOperator(StreamOperatorWrapper.java:239)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.lambda$deferFinishOperatorToMailbox$3(StreamOperatorWrapper.java:213)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper$$Lambda$1825/0x7f3d2482a900.run(Unknown
> > Source)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
> >
> > at
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.tryYield(MailboxExecutorImpl.java:97)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.quiesceTimeServiceAndFinishOperator(StreamOperatorWrapper.java:186)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.finish(StreamOperatorWrapper.java:152)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.finishOperators(RegularOperatorChain.java:115)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.endData(StreamTask.java:600)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:559)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$765/0x7f3d326ee058.runDefaultAction(Unknown
> > Source)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831)
> >
> > at
> >
> >
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780)
> >
> > at
> >
> >
> org.apache.flink.runtime.taskmanager.Task$$Lambda$1637/0x7f3d2a9b1470.run(Unknown
> > Source)
> >
> > at
> >
> >
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
> >
> > at
> > org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914)
> >
> > at 

[jira] [Created] (FLINK-33764) Incorporate GC / Heap metrics in autoscaler decisions

2023-12-06 Thread Gyula Fora (Jira)
Gyula Fora created FLINK-33764:
--

 Summary: Incorporate GC / Heap metrics in autoscaler decisions
 Key: FLINK-33764
 URL: https://issues.apache.org/jira/browse/FLINK-33764
 Project: Flink
  Issue Type: New Feature
  Components: Autoscaler, Kubernetes Operator
Reporter: Gyula Fora
Assignee: Gyula Fora


The autoscaler currently doesn't use any GC/HEAP metrics as part of the scaling 
decisions. 

While the long term goal may be to support vertical scaling (increasing TM 
sizes) currently this is out of scope for the autoscaler.

However it is very important to detect cases where the throughput of certain 
vertices or the entire pipeline is critically affected by long GC pauses. In 
these cases the current autoscaler logic would wrongly assume a low true 
processing rate and scale the pipeline too high, ramping up costs and causing 
further issues.

Using the improved GC metrics introduced in 
https://issues.apache.org/jira/browse/FLINK-33318 we should measure the GC 
pauses and simply block scaling decisions if the pipeline spends too much time 
garbage collecting and notify the user about the required action to increase 
memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Flink Operator - Supporting Recovery from Snapshot

2023-12-06 Thread Gyula Fóra
Hi All!

Based on some continuous feedback and experience, we feel that it may be a
good time to introduce this functionality in a way that doesn't
accidentally affect existing users in an unexpected way.

Please see: https://issues.apache.org/jira/browse/FLINK-33763 for details
and review.

Cheers,
Gyula

On Fri, Feb 10, 2023 at 7:27 PM Kevin Lam 
wrote:

> Hey Yaroslav!
>
> Awesome, good to know that approach works well for you. I think our plan as
> of now is to do the same--delete the current FlinkDeployment when deploying
> from a specific snapshot. It'll be a separate workflow from normal
> deployments to take advantage of the operator otherwise.
>
> Thanks!
>
> On Fri, Feb 10, 2023 at 12:23 PM Yaroslav Tkachenko
>  wrote:
>
> > Hi Kevin!
> >
> > In my case, I automated this workflow by first deleting the current Flink
> > deployment and then creating a new one. So, if the initialSavepointPath
> is
> > different it'll use it for recovery.
> >
> > This approach is indeed irreversible, but so far it's been working well.
> >
> > On Fri, Feb 10, 2023 at 8:17 AM Kevin Lam  >
> > wrote:
> >
> > > Thanks for the response Gyula! Those caveats make sense, and I see,
> > there's
> > > a bit of a complexity to consider if the feature is implemented. I do
> > think
> > > it would be useful, so would also love to hear what others think!
> > >
> > > On Wed, Feb 8, 2023 at 3:47 AM Gyula Fóra 
> wrote:
> > >
> > > > Hi Kevin!
> > > >
> > > > Thanks for starting this discussion.
> > > >
> > > > On a high level what you are proposing is quite simple: if the
> initial
> > > > savepoint path changes we use that for the upgrade.
> > > >
> > > > I see a few caveats here that may be important:
> > > >
> > > >  1. To use a new savepoint/checkpoint path for recovery we have to
> stop
> > > the
> > > > job and delete all HA metadata. This means that this operation may
> not
> > be
> > > > "reversible" in some cases because we lose the checkpoint info with
> the
> > > HA
> > > > metadata (unless we force a savepoint on shutdown).
> > > >  2. This will break the current upgrade/checkpoint ownership model in
> > > which
> > > > the operator controls the checkpoints and ensures that you always get
> > the
> > > > latest (or an error). It will also make the reconciliation logic more
> > > > complex
> > > >  3. This could be a breaking change for current users (if for some
> > reason
> > > > they rely on the current behaviour, which is weird but still true)
> > > >  4. The name initialSavepointPath becomes a bit misleading
> > > >
> > > > I agree that it would be nice to make this easier for the user, but
> the
> > > > question is whether what we gain by this is worth the extra
> complexity.
> > > > I think under normal circumstances the user does not really want to
> > > > suddenly redeploy the job starting from a new state. If that happens
> I
> > > > think it makes sense to create a new deployment resource and it's
> not a
> > > > very big overhead.
> > > >
> > > > Currently when "manual" recovery is needed are cases when the
> operator
> > > > loses track of the latest checkpoint, mostly due to "incorrect" error
> > > > handling on the Flink side that also deletes the HA metadata. I think
> > we
> > > > should strive to improve and eliminate most of these cases (as we
> have
> > > > already done for many of these problems).
> > > >
> > > > Would be great to hear what others think about this topic!
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam
> >  > > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I was reading the Flink Kubernetes Operator documentation and
> noticed
> > > > that
> > > > > if you want to redeploy a Flink job from a specific snapshot, you
> > must
> > > > > follow these manual recovery steps. Are there plans to streamline
> > this
> > > > > process? Deploying from a specific snapshot is a relatively common
> > > > > operation and it'd be nice to not need to delete the
> FlinkDeployment
> > > > >
> > > > > I wonder if the Flink Operator could use the initialSavepointPath
> > > similar
> > > > > to the restartNonce and savepointTriggerNonce parameters, where if
> > > > > initialSavepointPath changes, the deployed job is restored from the
> > > > > specified savepoint. Any thoughts?
> > > > >
> > > > > Thanks!
> > > > >
> > > >
> > >
> >
>


[jira] [Created] (FLINK-33763) Support manual savepoint redeploy for jobs and deployments

2023-12-06 Thread Gyula Fora (Jira)
Gyula Fora created FLINK-33763:
--

 Summary: Support manual savepoint redeploy for jobs and deployments
 Key: FLINK-33763
 URL: https://issues.apache.org/jira/browse/FLINK-33763
 Project: Flink
  Issue Type: New Feature
  Components: Kubernetes Operator
Reporter: Gyula Fora


A common request is to support a streamlined, user friendly way of redeploying 
from a target savepoint.

Previously this was only possible by deleting the CR and recreating it with 
initialSavepointPath. A big downside of this approach is a loss of 
savepoint/checkpoint history in the status that some platforms may need, 
resulting in non-cleaned up save points etc.

We suggest to introduce a `savepointRedeployNonce` field in the job spec 
similar to other action trigger nonces.

If the nonce changes to a new non null value the job will be redeployed from 
the path specified in the initialSavepointPath (or empty state If the path is 
empty)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [jira] [Created] (FLINK-33753) ContinuousFileReaderOperator consume records as mini batch

2023-12-06 Thread Darin Amos
Just to confirm, I ran a test this morning using a large file with only a
single file split that took over 10 minutes to process. My
checkpoints still executed as expected every minute with minimal latency
(sub second). Though I ran this with 1.15.4.

What evidence do you have that this operator is causing the timeout?
Typically this operator should handle the checkpoint barrier quickly and
proceed with reading (ignoring alignment). In my early days I suspected a
similar issue but it was other checkpoint alignment issues.

Darin

On Wed, Dec 6, 2023 at 4:13 AM Prabhu Joseph 
wrote:

> Thanks Darin for the details.
>
> Below is the problematic StreamTask, which has not processed the priority
> event (checkpoint barrier) from the
> mailbox, and has been there throughout the checkpoint timeout interval. And
> it is reading and collecting all the records from the split.
> Not sure why the executor was idle during that period, even though I could
> see the upstream source task checkpoint acknowledgement
> message was received by the JobManager. The Flink Version is 1.16.
>
> at
>
> org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.readAndCollectRecord(ContinuousFileReaderOperator.java:400)
>
> at
>
> org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.processRecord(ContinuousFileReaderOperator.java:364)
>
> at
>
> org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.lambda$new$0(ContinuousFileReaderOperator.java:240)
>
> at
>
> org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator$$Lambda$1199/0x7f3d2b07a858.run(Unknown
> Source)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>
> at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield(MailboxExecutorImpl.java:86)
>
> at
>
> org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.finish(ContinuousFileReaderOperator.java:466)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper$$Lambda$1827/0x7f3d2482c058.run(Unknown
> Source)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.finishOperator(StreamOperatorWrapper.java:239)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.lambda$deferFinishOperatorToMailbox$3(StreamOperatorWrapper.java:213)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper$$Lambda$1825/0x7f3d2482a900.run(Unknown
> Source)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>
> at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.tryYield(MailboxExecutorImpl.java:97)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.quiesceTimeServiceAndFinishOperator(StreamOperatorWrapper.java:186)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.finish(StreamOperatorWrapper.java:152)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.finishOperators(RegularOperatorChain.java:115)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.endData(StreamTask.java:600)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:559)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$765/0x7f3d326ee058.runDefaultAction(Unknown
> Source)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831)
>
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780)
>
> at
>
> org.apache.flink.runtime.taskmanager.Task$$Lambda$1637/0x7f3d2a9b1470.run(Unknown
> Source)
>
> at
>
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
>
> at
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914)
>
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
>
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
>
> at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
>
> On Wed, Dec 6, 2023 at 8:11 AM Darin Amos  .invalid>
> wrote:
>
> > I apologize, I was a little off with my description, it's been a while
> > since I have looked at this code but I have refreshed myself.
> >
> > The line I referred to earlier was correct though. This operator only
> > processes records in a file split while the operator is idle, meaning
> there
> > are no more incoming file splits. After every read it checks 

[jira] [Created] (FLINK-33762) Versioned release of flink-connector-shared-utils python scripts

2023-12-06 Thread Peter Vary (Jira)
Peter Vary created FLINK-33762:
--

 Summary: Versioned release of flink-connector-shared-utils python 
scripts
 Key: FLINK-33762
 URL: https://issues.apache.org/jira/browse/FLINK-33762
 Project: Flink
  Issue Type: Sub-task
  Components: API / Python, Connectors / Common
Reporter: Peter Vary


We need a versioned release of the scripts stored in 
flink-connector-shared-utils/python directory. This will allow even 
incompatible changes for these scripts. The connector developers could chose 
which version of the scripts they depend on.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] REST API response parsing throws exception on new fields

2023-12-06 Thread Gyula Fóra
Thanks G, I think this is a reasonable proposal which will increase
compatibility between different Flink clients and versions such as SQL
Gateway, CLI, Flink operator etc.

I don't really see any harm in ignoring unknown json fields globally, but
this probably warrants a FLIP and a proper vote.

Cheers,
Gyula

On Wed, Dec 6, 2023 at 10:53 AM Gabor Somogyi 
wrote:

> Hi All,
>
> Since the possible solution can have effect on all the REST response
> deserialization I would like
> ask for opinions.
>
> *Problem statement:*
>
> At the moment Flink is not ignoring unknown fields when parsing REST
> responses. An example for such a class is JobDetailsInfo but this applies
> to all others. It would be good to add this support to increase
> compatibility.
>
> The real life use-case is when the Flink k8s operator wants to handle 2
> jobs with 2 different Flink versions where the newer version has added a
> new field to any REST response. Such case the operator has basically 2
> options:
> * Use the old Flink version -> Such case exception comes because new field
> comes but it's not expected
> * Use the new Flink version -> Such case exception comes because new field
> is not coming but expected
>
> To hack around this issue it requires quite some ugly code parts in the
> operator.
>
> The mentioned issue is tracked here:
> https://issues.apache.org/jira/browse/FLINK-33268
>
> *Proposed solution:*
>
> Ignore all unknown fields in case of REST response JSON deserialization.
> Important to know that strict serialization would stay the same as-is.
>
> Actual object mapper configuration can be found here:
>
> https://github.com/apache/flink/blob/3060ccd49cc8d19634b431dbf0f09ac875d0d422/flink-runtime/src/main/java/org/apache/flink/runtime/rest/util/RestMapperUtils.java#L31-L38
>
> Please share your opinion on this topic.
>
> BR,
> G
>


[DISCUSS] REST API response parsing throws exception on new fields

2023-12-06 Thread Gabor Somogyi
Hi All,

Since the possible solution can have effect on all the REST response
deserialization I would like
ask for opinions.

*Problem statement:*

At the moment Flink is not ignoring unknown fields when parsing REST
responses. An example for such a class is JobDetailsInfo but this applies
to all others. It would be good to add this support to increase
compatibility.

The real life use-case is when the Flink k8s operator wants to handle 2
jobs with 2 different Flink versions where the newer version has added a
new field to any REST response. Such case the operator has basically 2
options:
* Use the old Flink version -> Such case exception comes because new field
comes but it's not expected
* Use the new Flink version -> Such case exception comes because new field
is not coming but expected

To hack around this issue it requires quite some ugly code parts in the
operator.

The mentioned issue is tracked here:
https://issues.apache.org/jira/browse/FLINK-33268

*Proposed solution:*

Ignore all unknown fields in case of REST response JSON deserialization.
Important to know that strict serialization would stay the same as-is.

Actual object mapper configuration can be found here:
https://github.com/apache/flink/blob/3060ccd49cc8d19634b431dbf0f09ac875d0d422/flink-runtime/src/main/java/org/apache/flink/runtime/rest/util/RestMapperUtils.java#L31-L38

Please share your opinion on this topic.

BR,
G


Re: [jira] [Created] (FLINK-33753) ContinuousFileReaderOperator consume records as mini batch

2023-12-06 Thread Prabhu Joseph
Thanks Darin for the details.

Below is the problematic StreamTask, which has not processed the priority
event (checkpoint barrier) from the
mailbox, and has been there throughout the checkpoint timeout interval. And
it is reading and collecting all the records from the split.
Not sure why the executor was idle during that period, even though I could
see the upstream source task checkpoint acknowledgement
message was received by the JobManager. The Flink Version is 1.16.

at
org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.readAndCollectRecord(ContinuousFileReaderOperator.java:400)

at
org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.processRecord(ContinuousFileReaderOperator.java:364)

at
org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.lambda$new$0(ContinuousFileReaderOperator.java:240)

at
org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator$$Lambda$1199/0x7f3d2b07a858.run(Unknown
Source)

at
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)

at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)

at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield(MailboxExecutorImpl.java:86)

at
org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.finish(ContinuousFileReaderOperator.java:466)

at
org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper$$Lambda$1827/0x7f3d2482c058.run(Unknown
Source)

at
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)

at
org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.finishOperator(StreamOperatorWrapper.java:239)

at
org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.lambda$deferFinishOperatorToMailbox$3(StreamOperatorWrapper.java:213)

at
org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper$$Lambda$1825/0x7f3d2482a900.run(Unknown
Source)

at
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)

at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)

at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.tryYield(MailboxExecutorImpl.java:97)

at
org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.quiesceTimeServiceAndFinishOperator(StreamOperatorWrapper.java:186)

at
org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.finish(StreamOperatorWrapper.java:152)

at
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.finishOperators(RegularOperatorChain.java:115)

at
org.apache.flink.streaming.runtime.tasks.StreamTask.endData(StreamTask.java:600)

at
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:559)

at
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$765/0x7f3d326ee058.runDefaultAction(Unknown
Source)

at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)

at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831)

at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780)

at
org.apache.flink.runtime.taskmanager.Task$$Lambda$1637/0x7f3d2a9b1470.run(Unknown
Source)

at
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)

at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914)

at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)

at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)

at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)

On Wed, Dec 6, 2023 at 8:11 AM Darin Amos 
wrote:

> I apologize, I was a little off with my description, it's been a while
> since I have looked at this code but I have refreshed myself.
>
> The line I referred to earlier was correct though. This operator only
> processes records in a file split while the operator is idle, meaning there
> are no more incoming file splits. After every read it checks if there are
> any incoming file splits before continuing to read from the split. If there
> is indeed a new inbound file split, the loop will exit and it will re-queue
> itself to continue processing records later. You can see that here
> <
> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/source/ContinuousFileReaderOperator.java#L361
> >
> .
>
> When the loop is interrupted by a checkpoint barrier, snapshotState(...) is
> called and the reader grabs the state from the provided Format. In the
> normal case the state is simply the split offset (current progress
> indicator), in more complex scenarios you can create your own format class
> and provide whatever serializable state you desire. In my case we store
> additional metadata about the progress of the 

Re:Re: Re: [DISCUSS] FLIP-392: Deprecate the Legacy Group Window Aggregation

2023-12-06 Thread Xuyang
Hi, Shengkai. Thanks to share your thought. Let me answer related questions。


> Could you give an example about the pass-through column. A session window may 
> contain multiple rows, which value is selected by the windowoperator?


The table function make the entire inpyt row available in the output. Take an 
example in Flink, if the input row is  with schema ,
the output of the window tvf can also access the  with schema 
. The session window always output these multi rows with attached 
window_start,
window_end and window_time column.


> What's the behavior if all the data in the window have been removed?


Align the behavior of existing group window agg, and do not output data when 
the window is empty.


> Could you explain more details about the how window process CDC? For example, 
> what's the behavior if the window only gets the DELETE data from the upstream 
> Operator.


Also align the behavior of existing group window agg. However, there is a bug 
in group window agg that currently group window agg has different results when 
only
consuming -D records while using or not using minibatch. Refs more at 
FLINK-33760.


> The subtitle is not correct here.


Updated the doc to fix it.


> It's better if we can introduce a syntax like the `emit` keyword to set the 
> emit strategy for every window.
I agree with you. But I don't recommend using this syntax SESSION(data 
[PARTITION BY (keycols, ...)], DESCRIPTOR(timecol), gap, emit='') because it 
breaks the syntax in Flip-145.
I think using query hint is a better idea. Anyway, this work should belong to 
another flip.


> I think more work should be mentioned in the FLIP. What's the behavior if the 
> input schema contains a column named `window_start`?
> In the current design, `window_start` is a reserved keyword in the window TVF 
> syntax, but it is legal in the legacy implementation.


This is a good question. Perhaps we can introduce a config to add a specific 
config (named "table.window-additional-columns.prefix") to add a prefix 
of all window addition columns to solve this situation. For example, user can 
set the conf to "$", and the additional column from window will become 
"$window_start", "$window_end" and "$window_time". WDYT?


> In the FLIP, you mention the FLIP should introduce an option to fall back to 
> the legacy behavior. Could you tell us what's the name of the option?
> BTW, I think we should unify the implementation when window TVF can do all 
> the work that the legacy operator can do and there is no need to introduce an 
> option to fallback.


What about using config named "table.optimizer.window-rewrite-enabled"? If
I agree with you that util all features are aligned and everything is ok about 
window tvf, we should also remove this config about fallback.
But I think to be on the safe side, we can observe one or two versions of this 
rewrite, and allows users to roll back when problems arise.


> If we remove the legacy window operator in the future, how users upgrade 
> their jobs? Do you have any plan to support state migration from the legacy 
> window to Windows TVF?
IIRC, currently compatibility across middle versions of SQL is not guaranteed. 
Should we add constraints on this part?


Look for your feedback!







--

Best!
Xuyang





At 2023-12-05 12:06:27, "liu ron"  wrote:
>Hi, xuyang
>
>Thanks for starting this FLIP discussion, currently there are two types of
>window aggregation in Flink SQL, namely legacy group window aggregation and
>window tvf aggregation, these two types of window aggregation are not fully
>aligned in behavior, which will bring a lot of confusion to the users, so
>there is a need to unify and align them. I think the final ideal state
>should be that there is only one window tvf aggregation, which supports
>Tumble, HOP, Cumulate and Session windows, and supports consuming CDC data
>streams. There is also support for configuring EARLY-FIRE and LATER-FIRE.
>
>This FLIP is a continuation of FLIP-145, and also supports legacy group
>window aggregation to flat-migrate to the new window tvf agregation, which
>is very useful, especially for the support of CDC streams, a pain point
>that users often feedback. Big +1 for this FLIP.
>
>Best,
>Ron
>
>Xuyang  于2023年12月5日周二 11:11写道:
>
>> Hi, Feng and David.
>>
>>
>> Thank you very much to share your thoughts.
>>
>>
>> This flip does not include the official exposure of these experimental
>> conf to users. Thus there is not adetailed description of this part.
>> However, in view that some technical users may have added these
>> experimental conf in actual production jobs, the processing
>> of these conf while using window tvf syntax has been added to this flip.
>>
>>
>> Overall, the behavior of using these experimental parameters is no
>> different from before, and I think we should provide the compatibility
>> about using these experimental conf.
>>
>>
>> Look for your thoughs.
>>
>>
>>
>>
>> --
>>
>> Best!
>> Xuyang
>>
>>
>>
>>
>>
>>