from:"Gyula Fóra"

[ANNOUNCE] Apache Flink Kubernetes Operator 1.9.0 released

2024-07-03 Thread Gyula Fóra

The Apache Flink community is very happy to announce the release of Apache
Flink Kubernetes Operator 1.9.0.

The Flink Kubernetes Operator allows users to manage their Apache Flink
applications and their lifecycle through native k8s tooling like kubectl.

Release blogpost:
https://flink.apache.org/2024/07/02/apache-flink-kubernetes-operator-1.9.0-release-announcement/

The release is available for download at:
https://flink.apache.org/downloads.html

Maven artifacts for Flink Kubernetes Operator can be found at:
https://search.maven.org/artifact/org.apache.flink/flink-kubernetes-operator

Official Docker image for Flink Kubernetes Operator can be found at:
https://hub.docker.com/r/apache/flink-kubernetes-operator

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12354417

We would like to thank all contributors of the Apache Flink community who
made this release possible!

Regards,
Gyula Fora

[RESULT] [VOTE] Apache Flink Kubernetes Operator Release 1.9.0, release candidate #1

2024-07-01 Thread Gyula Fóra

I'm happy to announce that we have unanimously approved this release.

There are 5 approving votes, 4 of which are binding:

* Rui Fan (binding)
* Gyula Fora (binding)
* Robert Metzger (binding)
* Thomas Weise (binding)
* Mate Czagany (non-binding)

There are no disapproving votes.

Thanks everyone!
Gyula

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.9.0, release candidate #1

2024-07-01 Thread Gyula Fóra

Thank you all! I am closing this vote now and will post the results in
another thread.

Thank you

On Fri, Jun 28, 2024 at 7:10 PM Thomas Weise  wrote:

> +1 (binding)
>
> - verified source signatures/hashes
> - build from source and tests pass
> - install from packaged helm chart
>
> Thanks,
> Thomas
>
>
> On Fri, Jun 28, 2024 at 5:29 AM Robert Metzger 
> wrote:
>
> > +1 (binding)
> >
> > - checked the docker file contents
> > - installed the operator from the helm chart
> > - checked if it can still talk to an existing Flink cluster, deployed
> from
> > v1.8
> >
> > On Tue, Jun 25, 2024 at 9:05 AM Gyula Fóra  wrote:
> >
> > > +1 (binding)
> > >
> > > Verified:
> > >  - Sources/signates
> > >  - Install 1.9.0 from helm chart
> > >  - Stateful example job basic interactions
> > >  - Operator upgrade from 1.8.0 -> 1.9.0 with running flinkdeployments
> > >  - Flink-web PR looks good
> > >
> > > Cheers,
> > > Gyula
> > >
> > >
> > > On Wed, Jun 19, 2024 at 12:09 PM Gyula Fóra 
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I have updated the KEYs file and extended the expiration date so that
> > > > should not be an issue. Thanks for pointing that out.
> > > >
> > > > Gyula
> > > >
> > > > On Wed, 19 Jun 2024 at 12:07, Rui Fan <1996fan...@gmail.com> wrote:
> > > >
> > > >> Thanks Gyula and Mate for driving this release!
> > > >>
> > > >> +1 (binding)
> > > >>
> > > >> Except the key is expired, and leaving a couple of comments to the
> > > >> flink-web PR,
> > > >> the rest of them are fine.
> > > >>
> > > >> - Downloaded artifacts from dist ( svn co https://dist.apache
> > > >> .org/repos/dist/dev/flink/flink-kubernetes-operator-1.9.0-rc1/ )
> > > >> - Verified SHA512 checksums : ( for i in *.tgz; do echo $i;
> sha512sum
> > > >> --check $i.sha512; done )
> > > >> - Verified GPG signatures : ( for i in *.tgz; do echo $i; gpg
> --verify
> > > >> $i.asc $i; done)
> > > >> - Build the source with java-11 and java-17 ( mvn -T 20 clean
> install
> > > >> -DskipTests )
> > > >> - Verified the license header during build the source
> > > >> - Verified that chart and appVersion matches the target release
> (less
> > > the
> > > >> index.yaml and Chart.yaml )
> > > >> - Download Autoscaler standalone: wget https://repository.apache
> > > >> .org/content/repositories/orgapacheflink-1740/org/apache/flink/flink
> > > >> -autoscaler-standalone/1.9.0/flink-autoscaler-standalone-1.9.0.jar
> > > >> - Ran Autoscaler standalone locally, it works well with rescale api.
> > > >>
> > > >> Best,
> > > >> Rui
> > > >>
> > > >> On Wed, Jun 19, 2024 at 1:50 AM Mate Czagany 
> > > wrote:
> > > >>
> > > >> > Hi,
> > > >> >
> > > >> > +1 (non-binding)
> > > >> >
> > > >> > Note: Using the Apache Flink KEYS file [1] to verify the
> signatures
> > > your
> > > >> > key seems to be expired, so that file should be updated as well.
> > > >> >
> > > >> > - Verified checksums and signatures
> > > >> > - Built source distribution
> > > >> > - Verified all pom.xml versions are the same
> > > >> > - Verified install from RC repo
> > > >> > - Verified Chart.yaml and values.yaml contents
> > > >> > - Submitted basic example with 1.17 and 1.19 Flink versions in
> > native
> > > >> and
> > > >> > standalone mode
> > > >> > - Tested operator HA, added new watched namespace dynamically
> > > >> > - Checked operator logs
> > > >> >
> > > >> > Regards,
> > > >> > Mate
> > > >> >
> > > >> > [1] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > >> >
> > > >> > Gyula Fóra  ezt írta (időpont: 2024. jún.
> > 18.,
> > > K,
> > > >> > 8:14):
> > > >> >
> > > >> > > Hi Everyone,
> > > >> > >
> > &g

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.9.0, release candidate #1

2024-06-25 Thread Gyula Fóra

+1 (binding)

Verified:
 - Sources/signates
 - Install 1.9.0 from helm chart
 - Stateful example job basic interactions
 - Operator upgrade from 1.8.0 -> 1.9.0 with running flinkdeployments
 - Flink-web PR looks good

Cheers,
Gyula


On Wed, Jun 19, 2024 at 12:09 PM Gyula Fóra  wrote:

> Hi,
>
> I have updated the KEYs file and extended the expiration date so that
> should not be an issue. Thanks for pointing that out.
>
> Gyula
>
> On Wed, 19 Jun 2024 at 12:07, Rui Fan <1996fan...@gmail.com> wrote:
>
>> Thanks Gyula and Mate for driving this release!
>>
>> +1 (binding)
>>
>> Except the key is expired, and leaving a couple of comments to the
>> flink-web PR,
>> the rest of them are fine.
>>
>> - Downloaded artifacts from dist ( svn co https://dist.apache
>> .org/repos/dist/dev/flink/flink-kubernetes-operator-1.9.0-rc1/ )
>> - Verified SHA512 checksums : ( for i in *.tgz; do echo $i; sha512sum
>> --check $i.sha512; done )
>> - Verified GPG signatures : ( for i in *.tgz; do echo $i; gpg --verify
>> $i.asc $i; done)
>> - Build the source with java-11 and java-17 ( mvn -T 20 clean install
>> -DskipTests )
>> - Verified the license header during build the source
>> - Verified that chart and appVersion matches the target release (less the
>> index.yaml and Chart.yaml )
>> - Download Autoscaler standalone: wget https://repository.apache
>> .org/content/repositories/orgapacheflink-1740/org/apache/flink/flink
>> -autoscaler-standalone/1.9.0/flink-autoscaler-standalone-1.9.0.jar
>> - Ran Autoscaler standalone locally, it works well with rescale api.
>>
>> Best,
>> Rui
>>
>> On Wed, Jun 19, 2024 at 1:50 AM Mate Czagany  wrote:
>>
>> > Hi,
>> >
>> > +1 (non-binding)
>> >
>> > Note: Using the Apache Flink KEYS file [1] to verify the signatures your
>> > key seems to be expired, so that file should be updated as well.
>> >
>> > - Verified checksums and signatures
>> > - Built source distribution
>> > - Verified all pom.xml versions are the same
>> > - Verified install from RC repo
>> > - Verified Chart.yaml and values.yaml contents
>> > - Submitted basic example with 1.17 and 1.19 Flink versions in native
>> and
>> > standalone mode
>> > - Tested operator HA, added new watched namespace dynamically
>> > - Checked operator logs
>> >
>> > Regards,
>> > Mate
>> >
>> > [1] https://dist.apache.org/repos/dist/release/flink/KEYS
>> >
>> > Gyula Fóra  ezt írta (időpont: 2024. jún. 18., K,
>> > 8:14):
>> >
>> > > Hi Everyone,
>> > >
>> > > Please review and vote on the release candidate #1 for the version
>> 1.9.0
>> > of
>> > > Apache Flink Kubernetes Operator,
>> > > as follows:
>> > > [ ] +1, Approve the release
>> > > [ ] -1, Do not approve the release (please provide specific comments)
>> > >
>> > > **Release Overview**
>> > >
>> > > As an overview, the release consists of the following:
>> > > a) Kubernetes Operator canonical source distribution (including the
>> > > Dockerfile), to be deployed to the release repository at
>> dist.apache.org
>> > > b) Kubernetes Operator Helm Chart to be deployed to the release
>> > repository
>> > > at dist.apache.org
>> > > c) Maven artifacts to be deployed to the Maven Central Repository
>> > > d) Docker image to be pushed to dockerhub
>> > >
>> > > **Staging Areas to Review**
>> > >
>> > > The staging areas containing the above mentioned artifacts are as
>> > follows,
>> > > for your review:
>> > > * All artifacts for a,b) can be found in the corresponding dev
>> repository
>> > > at dist.apache.org [1]
>> > > * All artifacts for c) can be found at the Apache Nexus Repository [2]
>> > > * The docker image for d) is staged on github [3]
>> > >
>> > > All artifacts are signed with the key 21F06303B87DAFF1 [4]
>> > >
>> > > Other links for your review:
>> > > * JIRA release notes [5]
>> > > * source code tag "release-1.9.0-rc1" [6]
>> > > * PR to update the website Downloads page to
>> > > include Kubernetes Operator links [7]
>> > >
>> > > **Vote Duration**
>> > >
>> > > The voting time will run for at least 72 hours.
>> > > It is adopte

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.9.0, release candidate #1

2024-06-19 Thread Gyula Fóra

Hi,

I have updated the KEYs file and extended the expiration date so that
should not be an issue. Thanks for pointing that out.

Gyula

On Wed, 19 Jun 2024 at 12:07, Rui Fan <1996fan...@gmail.com> wrote:

> Thanks Gyula and Mate for driving this release!
>
> +1 (binding)
>
> Except the key is expired, and leaving a couple of comments to the
> flink-web PR,
> the rest of them are fine.
>
> - Downloaded artifacts from dist ( svn co https://dist.apache
> .org/repos/dist/dev/flink/flink-kubernetes-operator-1.9.0-rc1/ )
> - Verified SHA512 checksums : ( for i in *.tgz; do echo $i; sha512sum
> --check $i.sha512; done )
> - Verified GPG signatures : ( for i in *.tgz; do echo $i; gpg --verify
> $i.asc $i; done)
> - Build the source with java-11 and java-17 ( mvn -T 20 clean install
> -DskipTests )
> - Verified the license header during build the source
> - Verified that chart and appVersion matches the target release (less the
> index.yaml and Chart.yaml )
> - Download Autoscaler standalone: wget https://repository.apache
> .org/content/repositories/orgapacheflink-1740/org/apache/flink/flink
> -autoscaler-standalone/1.9.0/flink-autoscaler-standalone-1.9.0.jar
> - Ran Autoscaler standalone locally, it works well with rescale api.
>
> Best,
> Rui
>
> On Wed, Jun 19, 2024 at 1:50 AM Mate Czagany  wrote:
>
> > Hi,
> >
> > +1 (non-binding)
> >
> > Note: Using the Apache Flink KEYS file [1] to verify the signatures your
> > key seems to be expired, so that file should be updated as well.
> >
> > - Verified checksums and signatures
> > - Built source distribution
> > - Verified all pom.xml versions are the same
> > - Verified install from RC repo
> > - Verified Chart.yaml and values.yaml contents
> > - Submitted basic example with 1.17 and 1.19 Flink versions in native and
> > standalone mode
> > - Tested operator HA, added new watched namespace dynamically
> > - Checked operator logs
> >
> > Regards,
> > Mate
> >
> > [1] https://dist.apache.org/repos/dist/release/flink/KEYS
> >
> > Gyula Fóra  ezt írta (időpont: 2024. jún. 18., K,
> > 8:14):
> >
> > > Hi Everyone,
> > >
> > > Please review and vote on the release candidate #1 for the version
> 1.9.0
> > of
> > > Apache Flink Kubernetes Operator,
> > > as follows:
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > > **Release Overview**
> > >
> > > As an overview, the release consists of the following:
> > > a) Kubernetes Operator canonical source distribution (including the
> > > Dockerfile), to be deployed to the release repository at
> dist.apache.org
> > > b) Kubernetes Operator Helm Chart to be deployed to the release
> > repository
> > > at dist.apache.org
> > > c) Maven artifacts to be deployed to the Maven Central Repository
> > > d) Docker image to be pushed to dockerhub
> > >
> > > **Staging Areas to Review**
> > >
> > > The staging areas containing the above mentioned artifacts are as
> > follows,
> > > for your review:
> > > * All artifacts for a,b) can be found in the corresponding dev
> repository
> > > at dist.apache.org [1]
> > > * All artifacts for c) can be found at the Apache Nexus Repository [2]
> > > * The docker image for d) is staged on github [3]
> > >
> > > All artifacts are signed with the key 21F06303B87DAFF1 [4]
> > >
> > > Other links for your review:
> > > * JIRA release notes [5]
> > > * source code tag "release-1.9.0-rc1" [6]
> > > * PR to update the website Downloads page to
> > > include Kubernetes Operator links [7]
> > >
> > > **Vote Duration**
> > >
> > > The voting time will run for at least 72 hours.
> > > It is adopted by majority approval, with at least 3 PMC affirmative
> > votes.
> > >
> > > **Note on Verification**
> > >
> > > You can follow the basic verification guide here[8].
> > > Note that you don't need to verify everything yourself, but please make
> > > note of what you have tested together with your +- vote.
> > >
> > > Cheers!
> > > Gyula Fora
> > >
> > > [1]
> > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.9.0-rc1/
> > > [2]
> > >
> https://repository.apache.org/content/repositories/orgapacheflink-1740/
> > > [3]  ghcr.io/apache/flink-kubernetes-operator:17129ff
> > > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > [5]
> > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12354417
> > > [6]
> > >
> >
> https://github.com/apache/flink-kubernetes-operator/tree/release-1.9.0-rc1
> > > [7] https://github.com/apache/flink-web/pull/747
> > > [8]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> > >
> >
>

[VOTE] Apache Flink Kubernetes Operator Release 1.9.0, release candidate #1

2024-06-18 Thread Gyula Fóra

Hi Everyone,

Please review and vote on the release candidate #1 for the version 1.9.0 of
Apache Flink Kubernetes Operator,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

**Release Overview**

As an overview, the release consists of the following:
a) Kubernetes Operator canonical source distribution (including the
Dockerfile), to be deployed to the release repository at dist.apache.org
b) Kubernetes Operator Helm Chart to be deployed to the release repository
at dist.apache.org
c) Maven artifacts to be deployed to the Maven Central Repository
d) Docker image to be pushed to dockerhub

**Staging Areas to Review**

The staging areas containing the above mentioned artifacts are as follows,
for your review:
* All artifacts for a,b) can be found in the corresponding dev repository
at dist.apache.org [1]
* All artifacts for c) can be found at the Apache Nexus Repository [2]
* The docker image for d) is staged on github [3]

All artifacts are signed with the key 21F06303B87DAFF1 [4]

Other links for your review:
* JIRA release notes [5]
* source code tag "release-1.9.0-rc1" [6]
* PR to update the website Downloads page to
include Kubernetes Operator links [7]

**Vote Duration**

The voting time will run for at least 72 hours.
It is adopted by majority approval, with at least 3 PMC affirmative votes.

**Note on Verification**

You can follow the basic verification guide here[8].
Note that you don't need to verify everything yourself, but please make
note of what you have tested together with your +- vote.

Cheers!
Gyula Fora

[1]
https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.9.0-rc1/
[2] https://repository.apache.org/content/repositories/orgapacheflink-1740/
[3]  ghcr.io/apache/flink-kubernetes-operator:17129ff
[4] https://dist.apache.org/repos/dist/release/flink/KEYS
[5]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12354417
[6]
https://github.com/apache/flink-kubernetes-operator/tree/release-1.9.0-rc1
[7] https://github.com/apache/flink-web/pull/747
[8]
https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release

Re: [VOTE] FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing for the AdaptiveScheduler

2024-06-17 Thread Gyula Fóra

+1 (binding)

Gyula

On Mon, Jun 17, 2024 at 11:29 AM Zakelly Lan  wrote:

> +1 (binding)
>
>
> Best,
> Zakelly
>
> On Mon, Jun 17, 2024 at 5:10 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > +1 (binding)
> >
> > Best,
> > Rui
> >
> > On Mon, Jun 17, 2024 at 4:58 PM David Morávek 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > On Mon, Jun 17, 2024 at 10:24 AM Matthias Pohl 
> > wrote:
> > >
> > > > Hi everyone,
> > > > the discussion in [1] about FLIP-461 [2] is kind of concluded. I am
> > > > starting a vote on this one here.
> > > >
> > > > The vote will be open for at least 72 hours (i.e. until June 20,
> 2024;
> > > > 8:30am UTC) unless there are any objections. The FLIP will be
> > considered
> > > > accepted if 3 binding votes (from active committers according to the
> > > Flink
> > > > bylaws [3]) are gathered by the community.
> > > >
> > > > Best,
> > > > Matthias
> > > >
> > > > [1] https://lists.apache.org/thread/nnkonmsv8xlk0go2sgtwnphkhrr5oc3y
> > > > [2]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler
> > > > [3]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Bylaws#FlinkBylaws-Approvals
> > > >
> > >
> >
>

Flink Kubernetes Operator 1.9.0 release planning

2024-06-10 Thread Gyula Fóra

Hi all!

I want to kick off the discussion / release process for the Flink
Kubernetes Operator 1.9.0 version.

The last, 1.8.0, version was released in March and since then we have had a
number of important fixes. Furthermore there are some bigger pieces of
outstanding work in the form of open PRs such as the Savepoint CRD work
which should only be merged to 1.10.0 to gain more exposure/stability.

I suggest we cut the release branch this week after merging current
outstanding smaller PRs.

I volunteer as the release manager but if someone else would like to do it,
I would also be happy to assist.

Please let me know what you think.

Cheers,
Gyula

Savepoints not considered during failover

2024-06-07 Thread Gyula Fóra

Hey Devs!

What is the reason / rationale for savepoints being ignored during failover
scenarios?

I see they are not even recorded as the last valid checkpoint in the HA
metadata (only the checkpoint id counter is bumped) so if the JM fails
after a manually triggered savepoint the job will still fall back to the
previous checkpoint instead.

I am sure there must have been some discussion around it but I cant find it.

Thank you!
Gyula

Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

2024-05-30 Thread Gyula Fóra

David,

The problem is exactly that ResourceLifecycleStates do not correspond to
specific Job statuses (JobReady condition) in most cases. Let me give you a
concrete example:

ResourceLifecycleState.STABLE means that app/job defined in the spec has
been successfully deployed and was observed running, and this spec is now
considered to be stable (won't be rolled back). Once a resource
(FlinkDeployment) reached STABLE state, it won't change unless the user
changes the spec. At the same time, this doesn't really say anything about
job health/readiness at any given future time. 10 minutes later the job can
go in an unrecoverable failure loop and never reach a running status, the
ResourceLifecycleState will remain STABLE.

This is actually not a problem with the ResourceLifecycleState but more
with the understanding of it. It's called ResourceLifecycleState and not
JobState exactly because it refers to the upgrade/rollback/suspend etc
lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the
underlying flink job itself.

But this is a crucial detail here that we need to consider otherwise the
"Ready" condition that we may create will be practically useless.

This is the reason why @morh...@apache.org  and
I suggest separating this to at least 2 independent conditions. One could
be the UpgradeCompleted/ReconciliationCompleted or something along these
lines computed based on LifecycleState (as described in your proposal but
with a different name). The other should be JobReady which could initially
work based on the JobStatus.state field but ideally would be user
configurable ready condition such as (job running at least 10 minutes,
running and have taken checkpoints etcetc).

These 2 conditions should be enough to start with and would actually
provide a tangible value to users. We can probably leave out ClusterReady
on a second thought.

Cheers,
Gyula

On Wed, May 29, 2024 at 5:16 PM David Radley 
wrote:

> Hi Gyula,
> Thank you for the quick response and confirmation we need a Flip. I am not
> an expert at K8s, Lajith will answer in more detail. Some questions I had
> anyway:
>
> I assume each of the ResourceLifecycleState do have a corresponding
> jobReady status. You point out some mistakes in the table, for example that
> STABLE should be NotReady; thankyou.  If we put a reason mentioning the
> stable state, this would help us understand the jobStatus.
>
> I guess the jobReady is one perspective that we know is useful (with
> corrected  mappings from ResourceLifecycleState and with reasons). Can I
> check that the  2 proposed conditions would also be useful additions? I
> assume that in your proposal  when jobReady is true, then UpgradeCompleted
> condition would not be present and ClusterReady would always be true? I
> know conditions do not need to be orthogonal, but I wanted to check what
> your thoughts are.
>
> Kind regards, David.
>
>
>
>
> From: Gyula Fóra 
> Date: Wednesday, 29 May 2024 at 15:28
> To: dev@flink.apache.org 
> Cc: morh...@apache.org 
> Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD
> Hi David!
>
> This change definitely warrants a FLIP even if the code change is not huge,
> there are quite some implications going forward.
>
> Looping in @morh...@apache.org  for this discussion.
>
> I have some questions / suggestions regarding the condition's meaning and
> naming.
>
> In your proposal you have:
>  - Ready (True/False) -> This condition is intended for resources which are
> fully ready and operational
>  - Error (True) -> This condition can be used in scenarios where any
> exception/error during resource reconcile process
>
> The problem with the above is that the implementation does not well reflect
> this. ResourceLifecycleState STABLE/ROLLED_BACK does not actually mean the
> job is running, it just means that the resource is fully reconciled and it
> will not be rolled back (so the current pending upgrade is completed). This
> is mainly a fault of the ResourceLifecycleState as it doesn't capture the
> job status but one could argue that it was "designed" this way.
>
> I think we should probably have more condition types to capture the
> difference:
>  - JobReady (True/False) -> Flink job is running (Basically job status but
> with transition time)
>  - ClusterReady (True/False) -> Session / Application cluster is deployed
> (Basically JM deployment status but with transition time)
> -  UpgradeCompleted (True/False) -> Similar to what you call Ready now
> which should correspond to the STABLE/ROLLED_BACK states and mostly tracks
> in-progress CR updates
>
> This is my best idea at the moment, not great as it feels a little
> redundant with the current status fields. But maybe thats not a problem or
> a way to elimina

Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

2024-05-29 Thread Gyula Fóra

Hi David!

This change definitely warrants a FLIP even if the code change is not huge,
there are quite some implications going forward.

Looping in @morh...@apache.org  for this discussion.

I have some questions / suggestions regarding the condition's meaning and
naming.

In your proposal you have:
 - Ready (True/False) -> This condition is intended for resources which are
fully ready and operational
 - Error (True) -> This condition can be used in scenarios where any
exception/error during resource reconcile process

The problem with the above is that the implementation does not well reflect
this. ResourceLifecycleState STABLE/ROLLED_BACK does not actually mean the
job is running, it just means that the resource is fully reconciled and it
will not be rolled back (so the current pending upgrade is completed). This
is mainly a fault of the ResourceLifecycleState as it doesn't capture the
job status but one could argue that it was "designed" this way.

I think we should probably have more condition types to capture the
difference:
 - JobReady (True/False) -> Flink job is running (Basically job status but
with transition time)
 - ClusterReady (True/False) -> Session / Application cluster is deployed
(Basically JM deployment status but with transition time)
-  UpgradeCompleted (True/False) -> Similar to what you call Ready now
which should correspond to the STABLE/ROLLED_BACK states and mostly tracks
in-progress CR updates

This is my best idea at the moment, not great as it feels a little
redundant with the current status fields. But maybe thats not a problem or
a way to eliminate the old fields later?

I am not so sure of the Error status and what this means in practice. Why
do we want to track the last error in 2 places? It's already in the status.

What do you think?
Gyula

On Wed, May 29, 2024 at 3:55 PM David Radley 
wrote:

> Hi,
> Thanks Lajith for raising this discussion thread under the Flip title.
>
> To summarise the concerns from the other discussion thread.
>
> “
> - I echo Gyula that including some examples and further explanations might
> ease reader's work. With the current version, the FLIP is a bit hard to
> follow. - Will the usage of Conditions be enabled by default? Or will there
> be any disadvantages for Flink users? If Conditions with the same type
> already exist in the Status Conditions
>
> - Do you think we should have clear rules about handling rules for how
> these Conditions should be managed, especially when multiple Conditions of
> the same type are present? For example, resource has multiple causes for
> the same condition (e.g., Error due to network and Error due to I/O). Then,
> overriding the old condition with the new one is not the best approach no?
> Please correct me if I misunderstood.
> “
>
> I see the Google doc link has been reformatted to match the Flip template.
>
> To explicitly answer the questions from Jeyhun and Gyula:
> - “Will the usage of Conditions be enabled by default?” Yes, but this is
> just making the status content useful, whereas before it was not useful.
> - in terms of examples, I am not sure what you would like to see, the
> table Lajith provided shows the status for various ResourceLifecycleStates.
> How the operator gets into these states is the current behaviour. The
> change just shows the appropriate corresponding high level status – that
> could be shown on the User Interfaces.
> - “will there be any disadvantages for Flink users?” None , there is just
> more information in the status, without this it is more difficult to work
> out the status of the job.
> - Multiple conditions question. The status is showing whether the job is
> ready or not, so as long as the last condition is the one that is shown -
> all is as expected. I don’t think this needs rules for precedence and the
> like.
> - The condition’s Reason is going to be more specific.
>
> Gyula and Jeyhun, is the google doc clear enough for you now? Do you feel
> you feedback has been addressed? Lajith and I are happy to provide more
> details.
>
> I wonder whether this change is big enough to warrant a Flip, as it is so
> small. We could do this in an issue. WDYT?
>
> Kind regards, David.
>
>
> From: Lajith Koova 
> Date: Wednesday, 29 May 2024 at 13:41
> To: dev@flink.apache.org 
> Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD
> Hello ,
>
>
> Discussion thread here:
> https://lists.apache.org/thread/dvy8w17pyjv68c3t962w49frl9odoz4z  to
> discuss a proposal to add Conditions field in the CR status of Flink
> Deployment and FlinkSessionJob.
>
>
> Note : Starting this new thread as discussion thread title has been
> modified to follow the FLIP process.
>
>
> Thank you.
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
>

Re: Discussion: Condition field in the CR status

2024-05-03 Thread Gyula Fóra

Hi Lajith!

Can you please include some examples in the document to help reviewers?
Just some examples with the status and the proposed conditions.

Cheers,
Gyula

On Wed, May 1, 2024 at 9:06 AM Lajith Koova  wrote:

> Hello,
>
>
> Starting discussion thread here to discuss a proposal to add Conditions
> field in the CR status of Flink Deployment and FlinkSessionJob.
>
>
> Here is the google doc with details. Please provide your thoughts/inputs.
>
>
>
> https://docs.google.com/document/d/12wlJCL_Vq2KZnABzK7OR7gAd1jZMmo0MxgXQXqtWODs/edit?usp=sharing
>
>
> Thanks
> Lajith
>

Re: [VOTE] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-25 Thread Gyula Fóra

That's my fault @Robert Metzger  , since the new FLIP
process and a lack of confluent access for non-committers this is a bit
tricky to get it sync :)

Gyula

On Thu, Apr 25, 2024 at 4:17 PM Robert Metzger  wrote:

> In principle I'm +1 on the proposal, but I think the FLIP in the wiki is
> not in sync with the Google doc.
> For example in the Wiki FlinkStateSnapshotSpec.backoffLimit is missing.
>
> On Thu, Apr 25, 2024 at 3:27 PM Thomas Weise  wrote:
>
> > +1 (binding)
> >
> >
> > On Wed, Apr 24, 2024 at 5:14 AM Yuepeng Pan 
> wrote:
> >
> > > +1(non-binding)
> > >
> > >
> > > Best,
> > > Yuepeng Pan
> > >
> > > At 2024-04-24 16:05:07, "Rui Fan" <1996fan...@gmail.com> wrote:
> > > >+1(binding)
> > > >
> > > >Best,
> > > >Rui
> > > >
> > > >On Wed, Apr 24, 2024 at 4:03 PM Mate Czagany 
> > wrote:
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> I'd like to start a vote on the FLIP-446: Kubernetes Operator State
> > > >> Snapshot CRD [1]. The discussion thread is here [2].
> > > >>
> > > >> The vote will be open for at least 72 hours unless there is an
> > > objection or
> > > >> insufficient votes.
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
> > > >> [2]
> https://lists.apache.org/thread/q5dzjwj0qk34rbg2sczyypfhokxoc3q7
> > > >>
> > > >> Regards,
> > > >> Mate
> > > >>
> > >
> >
>

Re: [VOTE] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-24 Thread Gyula Fóra

+1 (binding)

Gyula

On Wed, Apr 24, 2024 at 10:07 AM Ferenc Csaky 
wrote:

> +1 (non-binding), looking forward to this!
>
> Best,
> Ferenc
>
>
>
>
> On Wednesday, April 24th, 2024 at 10:03, Mate Czagany 
> wrote:
>
> >
> >
> > Hi everyone,
> >
> > I'd like to start a vote on the FLIP-446: Kubernetes Operator State
> > Snapshot CRD [1]. The discussion thread is here [2].
> >
> > The vote will be open for at least 72 hours unless there is an objection
> or
> > insufficient votes.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
> > [2] https://lists.apache.org/thread/q5dzjwj0qk34rbg2sczyypfhokxoc3q7
> >
> > Regards,
> > Mate
>

Re: [DISCUSS] FLIP-447: Upgrade FRocksDB from 6.20.3 to 8.10.0

2024-04-23 Thread Gyula Fóra

Thank you for driving this effort

+1

Cheers
Gyula

On Wed, 24 Apr 2024 at 06:25, Yuan Mei  wrote:

> Hey Yue,
>
> Thanks for all the great efforts significantly improving rescaling and
> upgrading rocksdb.
>
> +1 for this.
>
> Best
> Yuan
>
> On Wed, Apr 24, 2024 at 10:46 AM Zakelly Lan 
> wrote:
>
> > Hi Yue,
> >
> > Thanks for this proposal!
> >
> > Given the great improvement we could have, the slight regression in write
> > performance is a worthwhile trade-off, particularly as the mem-table
> > operations contribute only a minor part to the overall overhead. So +1
> for
> > this.
> >
> >
> > Best,
> > Zakelly
> >
> > On Tue, Apr 23, 2024 at 12:53 PM Yun Tang  wrote:
> >
> > > Hi Yue,
> > >
> > > Thanks for driving this work.
> > >
> > > It has been three years since last major upgrade of FRocksDB. And it
> > would
> > > be great improvement of Flink's state-backend with this upgrade. +1 for
> > > this work.
> > >
> > >
> > > Best
> > > Yun Tang
> > > 
> > > From: Yanfei Lei 
> > > Sent: Tuesday, April 23, 2024 12:50
> > > To: dev@flink.apache.org 
> > > Subject: Re: [DISCUSS] FLIP-447: Upgrade FRocksDB from 6.20.3 to 8.10.0
> > >
> > > Hi Yue & Roman,
> > >
> > > Thanks for initiating this FLIP and all the efforts for the upgrade.
> > >
> > > 8.10.0 introduces some new features, making it possible for Flink to
> > > implement some new exciting features, and the upgrade also makes
> > > FRocksDB easier to maintain, +1 for upgrading.
> > >
> > > I read the FLIP and have a minor comment, it would be better to add
> > > some description about the environment/configuration of the nexmark's
> > > result.
> > >
> > > Roman Khachatryan  于2024年4月23日周二 12:07写道：
> > >
> > > >
> > > > Hi,
> > > >
> > > > Thanks for writing the proposal and preparing the upgrade.
> > > >
> > > > FRocksDB  definitely needs to be kept in sync with the upstream and
> the
> > > new
> > > > APIs are necessary for faster rescaling.
> > > > We're already using a similar version internally.
> > > >
> > > > I reviewed the FLIP and it looks good to me (disclaimer: I took part
> in
> > > > some steps of this effort).
> > > >
> > > >
> > > > Regards,
> > > > Roman
> > > >
> > > > On Mon, Apr 22, 2024, 08:11 yue ma  wrote:
> > > >
> > > > > Hi Flink devs,
> > > > >
> > > > > I would like to start a discussion on FLIP-447: Upgrade FRocksDB
> from
> > > > > 6.20.3 to 8.10.0
> > > > >
> > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-447%3A+Upgrade+FRocksDB+from+6.20.3++to+8.10.0
> > > > >
> > > > > This FLIP proposes upgrading the version of FRocksDB in the Flink
> > > Project
> > > > > from 6.20.3 to 8.10.0.
> > > > > The FLIP mainly introduces the main benefits of upgrading FRocksDB,
> > > > > including the use of IngestDB which can improve Rescaling
> performance
> > > by
> > > > > more than 10 times in certain scenarios, as well as other potential
> > > > > optimization points such as async_io, blob db, and tiered
> storage.The
> > > > > FLIP also presented test results based on RocksDB 8.10, including
> > > > > StateBenchmark and Nexmark tests.
> > > > > Overall, upgrading FRocksDB may result in a small regression of
> write
> > > > > performance( which is a very small part of the overall overhead),
> but
> > > it
> > > > > can bring many important performance benefits.
> > > > > So we hope to upgrade the version of FRocksDB through this FLIP.
> > > > >
> > > > > Looking forward to everyone's feedback and suggestions. Thank you!
> > > > > --
> > > > > Best regards,
> > > > > Yue
> > > > >
> > >
> > >
> > >
> > > --
> > > Best,
> > > Yanfei
> > >
> >
>

Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-19 Thread Gyula Fóra

ser onboards a new Flink job to the operator.
> But I may not be thinking this through, so please let me know if you
> disagree.
>
> Thank you very much for your questions and suggestions!
>
> [1]
> https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
> [2]
> https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions
>
> Regards,
> Mate
>
> Thomas Weise  ezt írta (időpont: 2024. ápr. 19., P,
> 11:31):
>
>> Thanks for the proposal.
>>
>> How do you see potential effects on API server performance wrt. number of
>> objects vs mutations? Is the proposal more or less neutral in that regard?
>>
>> Thanks for the thorough feedback Robert.
>>
>> Couple more questions below.
>>
>> -->
>>
>> On Fri, Apr 19, 2024 at 5:07 AM Robert Metzger 
>> wrote:
>>
>> > Hi Mate,
>> > thanks for proposing this, I'm really excited about your FLIP. I hope my
>> > questions make sense to you:
>> >
>> > 1. I would like to discuss the "FlinkStateSnapshot" name and the fact
>> that
>> > users have to use either the savepoint or checkpoint spec inside the
>> > FlinkStateSnapshot.
>> > Wouldn't it be more intuitive to introduce two CRs:
>> > FlinkSavepoint and FlinkCheckpoint
>> > Ideally they can internally share a lot of code paths, but from a users
>> > perspective, the abstraction is much clearer.
>> >
>>
>> There are probably pros and cons either way. For example it is desirable
>> to
>> have a single list of state snapshots when looking for the initial
>> savepoint for a new deployment etc.
>>
>>
>> >
>> > 2. I also would like to discuss SavepointSpec.completed, as this name is
>> > not intuitive to me. How about "ignoreExisting"?
>> >
>> > 3. The FLIP proposal seems to leave error handling to the user, e.g.
>> when
>> > you create a FlinkStateSnapshot, it will just move to status FAILED.
>> > Typically in K8s with the control loop stuff, resources are tried to get
>> > created until success. I think it would be really nice if the
>> > FlinkStateSnapshot or FlinkSavepoint resource would retry based on a
>> > property in the resource. A "FlinkStateSnapshot.retries" number would
>> > indicate how often the user wants the operator to retry creating a
>> > savepoint, "retries = -1" means retry forever. In addition, we could
>> > consider a timeout as well, however, I haven't seen such a concept in
>> K8s
>> > CRs yet.
>> > The benefit of this is that other tools relying on the K8s operator
>> > wouldn't have to implement this retry loop (which is quite natural for
>> > K8s), they would just have to wait for the CR they've created to
>> transition
>> > into COMPLETED:
>> >
>> > 3. FlinkStateSnapshotStatus.error will only show the last error. What
>> > about using Events, so that we can show multiple errors and use the
>> > FlinkStateSnapshotState to report errors?
>> >
>> > 4. I wonder if it makes sense to use something like Pod Conditions (e.g.
>> > Savepoint Conditions):
>> >
>> https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions
>> > to track the completion status. We could have the following conditions:
>> > - Triggered
>> > - Completed
>> > - Failed
>> > The only benefit of this proposal that I see is that it would tell the
>> > user how long it took to create the savepoint.
>> >
>> > 5. You mention that "JobSpec.initialSavepointPath" will be deprecated. I
>> > assume we will introduce a new field for referencing a
>> FlinkStateSnapshot
>> > CR? I think it would be good to cover this in the FLIP.
>> >
>> > Does that mean one would have to create a FlinkStateSnapshot CR when
>> starting a new deployment from savepoint? If so, that's rather
>> complicated.
>> I would prefer something more simple/concise and would rather
>> keep initialSavepointPath
>>
>>
>> >
>> > One minor comment:
>> >
>> > "/** Dispose the savepoints upon CRD deletion. */"
>> >
>> > I think this should be "upon CR deletion", not "CRD deletion".
>> >
>> > Thanks again for this great FLIP!
>> >
>> > Best,
>> > Robert
>> >
>> >
>> > On Fri, Apr 19, 2024 at 9:01 AM Gyula Fóra 
>> wrote:
&

Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-19 Thread Gyula Fóra

Cc'ing some folks who gave positive feedback on this idea in the past.

I would love to hear your thoughts on the proposed design

Gyula

On Tue, Apr 16, 2024 at 6:31 PM Őrhidi Mátyás 
wrote:

> +1 Looking forward to it
>
> On Tue, Apr 16, 2024 at 8:56 AM Mate Czagany  wrote:
>
> > Thank you Gyula!
> >
> > I think that is a great idea. I have updated the Google doc to only have
> 1
> > new configuration option of boolean type, which can be used to signal the
> > Operator to use the old mode.
> >
> > Also described in the configuration description, the Operator will
> fallback
> > to the old mode if the FlinkStateSnapshot CRD cannot be found on the
> > Kubernetes cluster.
> >
> > Regards,
> > Mate
> >
> > Gyula Fóra  ezt írta (időpont: 2024. ápr. 16., K,
> > 17:01):
> >
> > > Thanks Mate, this is great stuff.
> > >
> > > Mate, I think the new configs should probably default to the new mode
> and
> > > they should only be useful for users to fall back to the old behaviour.
> > > We could by default use the new Snapshot CRD if the CRD is installed,
> > > otherwise use the old mode by default and log a warning on startup.
> > >
> > > So I am suggesting a "dynamic" default behaviour based on whether the
> new
> > > CRD was installed or not because we don't want to break operator
> startup.
> > >
> > > Gyula
> > >
> > > On Tue, Apr 16, 2024 at 4:48 PM Mate Czagany 
> wrote:
> > >
> > > > Hi Ferenc,
> > > >
> > > > Thank you for your comments, I have updated the Google docs with a
> new
> > > > section for the new configs.
> > > > All of the newly added config keys will have defaults set, and by
> > default
> > > > all the savepoint/checkpoint operations will use the old system:
> write
> > > > their results to the FlinkDeployment/FlinkSessionJob status field.
> > > >
> > > > I have also added a default for the checkpoint type to be FULL (which
> > is
> > > > also the default currently). That was an oversight on my part to miss
> > > that.
> > > >
> > > > Regards,
> > > > Mate
> > > >
> > > > Ferenc Csaky  ezt írta (időpont: 2024.
> > ápr.
> > > > 16., K, 16:10):
> > > >
> > > > > Thank you Mate for initiating this discussion. +1 for this idea.
> > > > > Some Qs:
> > > > >
> > > > > Can you specify the newly introduced configurations in more
> > > > > details? Currently, it is not fully clear to me what are the
> > > > > possible values of `kubernetes.operator.periodic.savepoint.mode`,
> > > > > is it optional, has a default value?
> > > > >
> > > > > I see that in `SavepointSpec.formatType` has a default, although
> > > > > `CheckppointSpec.checkpointType` not. Are we inferring that from
> > > > > the config? My point is, in general I think it would be good to
> > > > > handle the two snapshot types in a similar way when it makes sense
> > > > > to minimize any kind of confusion.
> > > > >
> > > > > Best,
> > > > > Ferenc
> > > > >
> > > > >
> > > > >
> > > > > On Tuesday, April 16th, 2024 at 11:34, Mate Czagany <
> > > czmat...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > Hi Everyone,
> > > > > >
> > > > > > I would like to start a discussion on FLIP-446: Kubernetes
> Operator
> > > > State
> > > > > > Snapshot CRD.
> > > > > >
> > > > > > This FLIP adds a new custom resource for Operator users to create
> > and
> > > > > > manage their savepoints and checkpoints. I have also developed an
> > > > initial
> > > > > > POC to prove that this approach is feasible, you can find the
> link
> > > for
> > > > > that
> > > > > > in the FLIP.
> > > > > >
> > > > > > There is a Confluence page [1] and a Google Docs page [2] as I do
> > not
> > > > > have
> > > > > > a Confluence account yet.
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
> > > > > > [2]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Mate
> > > > >
> > > >
> > >
> >
>

Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

2024-04-18 Thread Gyula Fóra

Hi Swathi!

Thank you for creating this proposal. I really like the general idea of
increasing the K8s native observability of Flink job errors.

I took a quick look at your reference PR, the termination log related logic
is contained completely in the ClusterEntrypoint. What type of errors will
this actually cover?

To me this seems to cover only:
 - Job main class errors (ie startup errors)
 - JobManager failures

Would regular job errors (that cause only job failover but not JM errors)
be reported somehow with this plugin?

Thanks
Gyula

On Tue, Apr 16, 2024 at 8:21 AM Swathi C  wrote:

> Hi All,
>
> I would like to start a discussion on FLIP-XXX : [Plugin] Enhancing Flink
> Failure Management in Kubernetes with Dynamic Termination Log Integration.
>
>
> https://docs.google.com/document/d/1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc/edit?usp=sharing
>
>
> This FLIP proposes an improvement plugin and focuses mainly on Flink on
> K8S but can be used as a generic plugin and add further enhancements.
>
> Looking forward to everyone's feedback and suggestions. Thank you !!
>
> Best Regards,
> Swathi Chandrashekar
>

Re: [DISCUSS] FLIP-446: Kubernetes Operator State Snapshot CRD

2024-04-16 Thread Gyula Fóra

Thanks Mate, this is great stuff.

Mate, I think the new configs should probably default to the new mode and
they should only be useful for users to fall back to the old behaviour.
We could by default use the new Snapshot CRD if the CRD is installed,
otherwise use the old mode by default and log a warning on startup.

So I am suggesting a "dynamic" default behaviour based on whether the new
CRD was installed or not because we don't want to break operator startup.

Gyula

On Tue, Apr 16, 2024 at 4:48 PM Mate Czagany  wrote:

> Hi Ferenc,
>
> Thank you for your comments, I have updated the Google docs with a new
> section for the new configs.
> All of the newly added config keys will have defaults set, and by default
> all the savepoint/checkpoint operations will use the old system: write
> their results to the FlinkDeployment/FlinkSessionJob status field.
>
> I have also added a default for the checkpoint type to be FULL (which is
> also the default currently). That was an oversight on my part to miss that.
>
> Regards,
> Mate
>
> Ferenc Csaky  ezt írta (időpont: 2024. ápr.
> 16., K, 16:10):
>
> > Thank you Mate for initiating this discussion. +1 for this idea.
> > Some Qs:
> >
> > Can you specify the newly introduced configurations in more
> > details? Currently, it is not fully clear to me what are the
> > possible values of `kubernetes.operator.periodic.savepoint.mode`,
> > is it optional, has a default value?
> >
> > I see that in `SavepointSpec.formatType` has a default, although
> > `CheckppointSpec.checkpointType` not. Are we inferring that from
> > the config? My point is, in general I think it would be good to
> > handle the two snapshot types in a similar way when it makes sense
> > to minimize any kind of confusion.
> >
> > Best,
> > Ferenc
> >
> >
> >
> > On Tuesday, April 16th, 2024 at 11:34, Mate Czagany 
> > wrote:
> >
> > >
> > >
> > > Hi Everyone,
> > >
> > > I would like to start a discussion on FLIP-446: Kubernetes Operator
> State
> > > Snapshot CRD.
> > >
> > > This FLIP adds a new custom resource for Operator users to create and
> > > manage their savepoints and checkpoints. I have also developed an
> initial
> > > POC to prove that this approach is feasible, you can find the link for
> > that
> > > in the FLIP.
> > >
> > > There is a Confluence page [1] and a Google Docs page [2] as I do not
> > have
> > > a Confluence account yet.
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-446%3A+Kubernetes+Operator+State+Snapshot+CRD
> > > [2]
> > >
> >
> https://docs.google.com/document/d/1VdfLFaE4i6ESbCQ38CH7TKOiPQVvXeOxNV2FeSMnOTg
> > >
> > >
> > > Regards,
> > > Mate
> >
>

Re: [DISCUSS] Flink Website Menu Adjustment

2024-03-25 Thread Gyula Fóra

+1 for the proposal

Gyula

On Mon, Mar 25, 2024 at 12:49 PM Leonard Xu  wrote:

> Thanks Zhongqiang for starting this discussion, updating documentation
> menus according to sub-projects' activities makes sense to me.
>
> +1 for the proposed menus:
>
> > After:
> >
> > With Flink
> > With Flink Kubernetes Operator
> > With Flink CDC
> > With Flink ML
> > With Flink Stateful Functions
> > Training Course
>
>
>
> Best,
> Leonard
>
> > 2024年3月25日 下午3:48，gongzhongqiang  写道：
> >
> > Hi everyone,
> >
> > I'd like to start a discussion on adjusting the Flink website [1] menu to
> > improve accuracy and usability.While migrating Flink CDC documentation
> > to the website, I found outdated links, need to review and update menus
> > for the most relevant information for our users.
> >
> >
> > Proposal:
> >
> > - Remove Paimon [2] from the "Getting Started" and "Documentation" menus:
> > Paimon [2] is now an independent top project of ASF. CC： jingsong lees
> >
> > - Sort the projects in the subdirectory by the activity of the projects.
> > Here I list the number of releases for each project in the past year.
> >
> > Flink Kubernetes Operator : 7
> > Flink CDC : 5
> > Flink ML  : 2
> > Flink Stateful Functions : 1
> >
> >
> > Expected Outcome :
> >
> > - Menu "Getting Started"
> >
> > Before:
> >
> > With Flink
> >
> > With Flink Stateful Functions
> >
> > With Flink ML
> >
> > With Flink Kubernetes Operator
> >
> > With Paimon(incubating) (formerly Flink Table Store)
> >
> > With Flink CDC
> >
> > Training Course
> >
> >
> > After:
> >
> > With Flink
> > With Flink Kubernetes Operator
> >
> > With Flink CDC
> >
> > With Flink ML
> >
> > With Flink Stateful Functions
> >
> > Training Course
> >
> >
> > - Menu "Documentation" will same with "Getting Started"
> >
> >
> > I look forward to hearing your thoughts and suggestions on this proposal.
> >
> > [1] https://flink.apache.org/
> > [2] https://github.com/apache/incubator-paimon
> > [3] https://github.com/apache/flink-statefun
> >
> >
> >
> > Best regards,
> >
> > Zhongqiang Gong
>
>

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-22 Thread Gyula Fóra

I agree, we would need some FLIPs to cover this. Actually there is already
some work on this topic initiated by Matthias Pohl (ccd).
Please see this:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+Merging+the+ExecutionGraphInfoStore+and+the+JobResultStore+into+a+single+component+CompletedJobStore

This FLIP actually covers some of these limitations already and other
outstanding issues in the operator.

Cheers,
Gyula

Re: [VOTE] FLIP-433: State Access on DataStream API V2

2024-03-21 Thread Gyula Fóra

+1 (binding)

Gyula

On Thu, Mar 21, 2024 at 3:33 AM Rui Fan <1996fan...@gmail.com> wrote:

> +1(binding)
>
> Thanks to Weijie for driving this proposal, which solves the problem that I
> raised in FLIP-359.
>
> Best,
> Rui
>
> On Thu, Mar 21, 2024 at 10:10 AM Hangxiang Yu  wrote:
>
> > +1 (binding)
> >
> > On Thu, Mar 21, 2024 at 10:04 AM Xintong Song 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Wed, Mar 20, 2024 at 8:30 PM weijie guo 
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > >
> > > > Thanks for all the feedback about the FLIP-433: State Access on
> > > > DataStream API V2 [1]. The discussion thread is here [2].
> > > >
> > > >
> > > > The vote will be open for at least 72 hours unless there is an
> > > > objection or insufficient votes.
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Weijie
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-433%3A+State+Access+on+DataStream+API+V2
> > > >
> > > > [2] https://lists.apache.org/thread/8ghzqkvt99p4k7hjqxzwhqny7zb7xrwo
> > > >
> > >
> >
> >
> > --
> > Best,
> > Hangxiang.
> >
>

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-20 Thread Gyula Fóra

Sorry for the late reply Kevin.

I think what you are suggesting makes sense, it would be basically a
`last-state` startup mode. This would also help in cases where the current
last-state mechanism fails to locate HA metadata (and the state).

This is somewhat of a tricky feature to implement:
 1. The operator will need FS plugins and access to the different user envs
(this will not work in many prod environments unfortunately)
 2. Flink doesn't expose a good way to detect the latest checkpoint just by
looking at the FS so we need to figure out something here. Probably some
changes are necessary on Flink core side as well

Gyula

Re: [VOTE] FLIP-439: Externalize Kudu Connector from Bahir

2024-03-20 Thread Gyula Fóra

+1 (binding)

Thanks!
Gyula

On Wed, Mar 20, 2024 at 3:36 PM Mate Czagany  wrote:

> +1 (non-binding)
>
> Thank you,
> Mate
>
> Ferenc Csaky  ezt írta (időpont: 2024. márc.
> 20., Sze, 15:11):
>
> > Hello devs,
> >
> > I would like to start a vote about FLIP-439 [1]. The FLIP is about to
> > externalize the Kudu
> > connector from the recently retired Apache Bahir project [2] to keep it
> > maintainable and
> > make it up to date as well. Discussion thread [3].
> >
> > The vote will be open for at least 72 hours (until 2024 March 23 14:03
> > UTC) unless there
> > are any objections or insufficient votes.
> >
> > Thanks,
> > Ferenc
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-439%3A+Externalize+Kudu+Connector+from+Bahir
> > [2] https://attic.apache.org/projects/bahir.html
> > [3] https://lists.apache.org/thread/oydhcfkco2kqp4hdd1glzy5vkw131rkz
>

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.8.0, release candidate #1

2024-03-18 Thread Gyula Fóra

Hi Max!

+1 (binding)

 - Verified source release, helm chart + checkpoints / signatures
 - Helm points to correct image
 - Deployed operator, stateful example and executed upgrade + savepoint
redeploy
 - Verified logs
 - Flink web PR looks good +1

A minor correction is that [3] in the email should point to:
ghcr.io/apache/flink-kubernetes-operator:91d67d9 . But the helm chart and
everything is correct. It's a typo in the vote email.

Thank you for preparing the release!

Cheers,
Gyula

On Mon, Mar 18, 2024 at 8:26 AM Rui Fan <1996fan...@gmail.com> wrote:

> Thanks Max for driving this release!
>
> +1(non-binding)
>
> - Downloaded artifacts from dist ( svn co
>
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.8.0-rc1/
> )
> - Verified SHA512 checksums : ( for i in *.tgz; do echo $i; sha512sum
> --check $i.sha512; done )
> - Verified GPG signatures : ( $ for i in *.tgz; do echo $i; gpg --verify
> $i.asc $i )
> - Build the source with java-11 and java-17 ( mvn -T 20 clean install
> -DskipTests )
> - Verified the license header during build the source
> - Verified that chart and appVersion matches the target release (less the
> index.yaml and Chart.yaml )
> - RC repo works as Helm repo( helm repo add flink-operator-repo-1.8.0-rc1
>
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.8.0-rc1/
> )
> - Verified Helm chart can be installed  ( helm install
> flink-kubernetes-operator
> flink-operator-repo-1.8.0-rc1/flink-kubernetes-operator --set
> webhook.create=false )
> - Submitted the autoscaling demo, the autoscaler works well with *memory
> tuning *(kubectl apply -f autoscaling.yaml)
>- job.autoscaler.memory.tuning.enabled: "true"
> - Download Autoscaler standalone: wget
>
> https://repository.apache.org/content/repositories/orgapacheflink-1710/org/apache/flink/flink-autoscaler-standalone/1.8.0/flink-autoscaler-standalone-1.8.0.jar
> - Ran Autoscaler standalone locally, it works well with rescale api and
> JDBC state store/event handler
>
> Best,
> Rui
>
> On Fri, Mar 15, 2024 at 1:45 AM Maximilian Michels  wrote:
>
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version
> > 1.8.0 of the Apache Flink Kubernetes Operator, as follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > **Release Overview**
> >
> > As an overview, the release consists of the following:
> > a) Kubernetes Operator canonical source distribution (including the
> > Dockerfile), to be deployed to the release repository at
> > dist.apache.org
> > b) Kubernetes Operator Helm Chart to be deployed to the release
> > repository at dist.apache.org
> > c) Maven artifacts to be deployed to the Maven Central Repository
> > d) Docker image to be pushed to Dockerhub
> >
> > **Staging Areas to Review**
> >
> > The staging areas containing the above mentioned artifacts are as
> > follows, for your review:
> > * All artifacts for (a), (b) can be found in the corresponding dev
> > repository at dist.apache.org [1]
> > * All artifacts for (c) can be found at the Apache Nexus Repository [2]
> > * The docker image for (d) is staged on github [3]
> >
> > All artifacts are signed with the key
> > DA359CBFCEB13FC302A8793FB655E6F7693D5FDE [4]
> >
> > Other links for your review:
> > * JIRA release notes [5]
> > * source code tag "release-1.8.0-rc1" [6]
> > * PR to update the website Downloads page to include Kubernetes
> > Operator links [7]
> >
> > **Vote Duration**
> >
> > The voting time will run for at least 72 hours. It is adopted by
> > majority approval, with at least 3 PMC affirmative votes.
> >
> > **Note on Verification**
> >
> > You can follow the basic verification guide here [8]. Note that you
> > don't need to verify everything yourself, but please make note of what
> > you have tested together with your +- vote.
> >
> > Thanks,
> > Max
> >
> > [1]
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.8.0-rc1/
> > [2]
> > https://repository.apache.org/content/repositories/orgapacheflink-1710/
> > [3] ghcr.io/apache/flink-kubernetes-operator:8938658
> > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [5]
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12353866=12315522
> > [6]
> >
> https://github.com/apache/flink-kubernetes-operator/tree/release-1.8.0-rc1
> > [7] https://github.com/apache/flink-web/pull/726
> > [8]
> >
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> >
>

Re: Unaligned checkpoint blocked by long Async operation

2024-03-15 Thread Gyula Fóra

Posting this to dev as well...

Thanks Zakelly,
Sounds like a solution could be to add a new different version of yield
that would actually yield to the checkpoint barrier too. That way operator
implementations could decide whether any state modification may or may not
have happened and can optionally allow checkpoint to be taken in the
"middle of record  processing".

Gyula

On Fri, Mar 15, 2024 at 3:49 AM Zakelly Lan  wrote:

> Hi Gyula,
>
> Processing checkpoint halfway through `processElement` is problematic. The
> current element will not be included in the input in-flight data, and we
> cannot assume it has taken effect on the state by user code. So the best
> way is to treat `processElement` as an 'atomic' operation. I guess that's
> why the priority of the cp barrier is set low.
> However, the AsyncWaitOperator is a special case where we know the element
> blocked at `addToWorkQueue` has not started triggering the userFunction.
> Thus I'd suggest putting the element in the queue when the cp barrier
> comes, and taking a snapshot of the whole queue afterwards. The problem
> will be solved. But this approach also involves some code modifications on
> the mailbox executor.
>
>
> Best,
> Zakelly
>
> On Thu, Mar 14, 2024 at 9:15 PM Gyula Fóra  wrote:
>
>> Thank you for the detailed analysis Zakelly.
>>
>> I think we should consider whether yield should process checkpoint
>> barriers because this puts quite a serious limitation on the unaligned
>> checkpoints in these cases.
>> Do you know what is the reason behind the current priority setting? Is
>> there a problem with processing the barrier here?
>>
>> Gyula
>>
>> On Thu, Mar 14, 2024 at 1:22 PM Zakelly Lan 
>> wrote:
>>
>>> Hi Gyula,
>>>
>>> Well I tried your example in local mini-cluster, and it seems the source
>>> can take checkpoints but it will block in the following AsyncWaitOperator.
>>> IIUC, the unaligned checkpoint barrier should wait until the current
>>> `processElement` finishes its execution. In your example, the element queue
>>> of `AsyncWaitOperator` will end up full and `processElement` will be
>>> blocked at `addToWorkQueue`. Even though it will call
>>> `mailboxExecutor.yield();`, it still leaves the checkpoint barrier
>>> unprocessed since the priority of the barrier is -1, lower than the one
>>> `yield()` should handle. I verified this using single-step debugging.
>>>
>>> And if one element could finish its async io, the cp barrier can be
>>> processed afterwards. For example:
>>> ```
>>> env.getCheckpointConfig().enableUnalignedCheckpoints();
>>> env.getCheckpointConfig().setCheckpointInterval(1);  // 10s interval
>>> env.getConfig().setParallelism(1);
>>> AsyncDataStream.orderedWait(
>>> env.fromSequence(Long.MIN_VALUE,
>>> Long.MAX_VALUE).shuffle(),
>>> new AsyncFunction() {
>>> boolean first = true;
>>> @Override
>>> public void asyncInvoke(Long aLong,
>>> ResultFuture resultFuture) {
>>> if (first) {
>>>
>>> Executors.newSingleThreadExecutor().execute(() -> {
>>> try {
>>> Thread.sleep(2); // process
>>> after 20s, only for the first one.
>>> } catch (Throwable e) {}
>>> LOG.info("Complete one");
>>>
>>> resultFuture.complete(Collections.singleton(1L));
>>> });
>>> first = false;
>>> }
>>> }
>>> },
>>> 24,
>>> TimeUnit.HOURS,
>>> 1)
>>> .print();
>>> ```
>>> The checkpoint 1 can be normally finished after the "Complete one" log
>>> print.
>>>
>>> I guess the users have no means to solve this problem, we might optimize
>>> this later.
>>>
>>>
>>> Best,
>>> Zakelly
>>>
>>> On Thu, Mar 14, 2024 at 5:41 PM Gyula Fóra  wrote:
>>>
>>>> Hey all!
>>>>
>>>> I encountered a strange and unexpected behaviour when trying to use
>>>> unaligned checkpoints with AsyncIO.
>>>>
>>>> If the async operation queue is full and backpressures the pipeline
>>>>

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-13 Thread Gyula Fóra

Hey Kevin!

The general mismatch I see here is that operators and resources are pretty
cluster dependent. The operator itself is running in the same cluster so it
feels out of scope to submit resources to different clusters, this doesn't
really sound like what any Kubernetes Operator should do in general.

To me this sounds more like a typical control plane feature that sits above
different environments and operator instances. There are a lot of features
like this, blue/green deployments also fall into this category in my head,
but there are of course many many others.

There may come a time when the Flink community decides to take on such a
scope but it feels a bit too much at this point to try to standardize this.

Cheers,
Gyula

On Wed, Mar 13, 2024 at 9:18 PM Kevin Lam 
wrote:

> Hi Max,
>
> It feels a bit hacky to need to back-up the resources directly from the
> cluster, as opposed to being able to redeploy our checked-in k8s manifests
> such that they failover correctly, but that makes sense to me and we can
> look into this approach. Thanks for the suggestion!
>
> I'd still be interested in hearing the community's thoughts on if we can
> support this in a more first-class way as part of the Apache Flink
> Kubernetes Operator.
>
> Thanks,
> Kevin
>
> On Wed, Mar 13, 2024 at 9:41 AM Maximilian Michels  wrote:
>
> > Hi Kevin,
> >
> > Theoretically, as long as you move over all k8s resources, failover
> > should work fine on the Flink and Flink Operator side. The tricky part
> > is the handover. You will need to backup all resources from the old
> > cluster, shutdown the old cluster, then re-create them on the new
> > cluster. The operator deployment and the Flink cluster should then
> > recover fine (assuming that high availability has been configured and
> > checkpointing is done to persistent storage available in the new
> > cluster). The operator state / Flink state is actually kept in
> > ConfigMaps which would be part of the resource dump.
> >
> > This method has proven to work in case of Kubernetes cluster upgrades.
> > Moving to an entirely new cluster is a bit more involved but exporting
> > all resource definitions and re-importing them into the new cluster
> > should yield the same result as long as the checkpoint paths do not
> > change.
> >
> > Probably something worth trying :)
> >
> > -Max
> >
> >
> >
> > On Wed, Mar 6, 2024 at 9:09 PM Kevin Lam 
> > wrote:
> > >
> > > Another thought could be modifying the operator to have a behaviour
> where
> > > upon first deploy, it optionally (flag/param enabled) finds the most
> > recent
> > > snapshot and uses that as the initialSavepointPath to restore and run
> the
> > > Flink job.
> > >
> > > On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam 
> wrote:
> > >
> > > > Hi there,
> > > >
> > > > We use the Flink Kubernetes Operator, and I am investigating how we
> can
> > > > easily support failing over a FlinkDeployment from one Kubernetes
> > Cluster
> > > > to another in the case of an outage that requires us to migrate a
> large
> > > > number of FlinkDeployments from one K8s cluster to another.
> > > >
> > > > I understand one way to do this is to set `initialSavepoint` on all
> the
> > > > FlinkDeployments to the most recent/appropriate snapshot so the jobs
> > > > continue from where they left off, but for a large number of jobs,
> this
> > > > would be quite a bit of manual labor.
> > > >
> > > > Do others have an approach they are using? Any advice?
> > > >
> > > > Could this be something addressed in a future FLIP? Perhaps we could
> > store
> > > > some kind of metadata in object storage so that the Flink Kubernetes
> > > > Operator can restore a FlinkDeployment from where it left off, even
> if
> > the
> > > > job is shifted to another Kubernetes Cluster.
> > > >
> > > > Looking forward to hearing folks' thoughts!
> > > >
> >
>

Re: [DISCUSS] Manual savepoint triggering in flink-kubernetes-operator

2024-03-12 Thread Gyula Fóra

That would be great Mate! If you could draw up a FLIP for this that would
be nice as this is a rather large change that will have a significant
impact for existing users.

If possible it would be good to provide some backward compatibility /
transition period while we preserve the current content of the status so
it's easy to migrate to the new savepoint CRs.

Cheers,
Gyula

On Tue, Mar 12, 2024 at 9:22 PM Mate Czagany  wrote:

> Hi,
>
> I really like this idea as well, I think it would be a great improvement
> compared to how manual savepoints currently work, and suits Kubernetes
> workflows a lot better.
>
> If there are no objections, I can investigate it during the next few weeks
> and see how this could be implemented in the current code.
>
> Cheers,
> Mate
>
> Gyula Fóra  ezt írta (időpont: 2024. márc. 12., K,
> 16:01):
>
> > That's definitely a good improvement Robert and we should add it at some
> > point. At the point in time when this was implemented we went with the
> > current simpler / more lightweight approach.
> > However if anyone is interested in working on this / contributing this
> > improvement I would personally support it.
> >
> > Gyula
> >
> > On Tue, Mar 12, 2024 at 3:53 PM Robert Metzger 
> > wrote:
> >
> > > Have you guys considered making savepoints a first class citizen in the
> > > Kubernetes operator?
> > > E.g. to trigger a savepoint, you create a "FlinkSavepoint" CR, the K8s
> > > operator picks up that resource and tries to create a savepoint
> > > indefinitely until the savepoint has been successfully created. We
> report
> > > the savepoint status and location in the "status" field.
> > >
> > > We could even add an (optional) finalizer to delete the physical
> > savepoint
> > > from the savepoint storage once the "FlinkSavepoint" CR has been
> deleted.
> > > optional: the savepoint spec could contain a field "retain
> > > physical savepoint" or something, that controls the delete behavior.
> > >
> > >
> > > On Thu, Mar 3, 2022 at 4:02 AM Yang Wang 
> wrote:
> > >
> > > > I agree that we could start with the annotation approach and collect
> > the
> > > > feedback at the same time.
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > Őrhidi Mátyás  于2022年3月2日周三 20:06写道：
> > > >
> > > > > Thank you for your feedback!
> > > > >
> > > > > The annotation on the
> > > > >
> > > > > @ControllerConfiguration(generationAwareEventProcessing = false)
> > > > > FlinkDeploymentController
> > > > >
> > > > > already enables the event triggering based on metadata changes. It
> > was
> > > > set
> > > > > earlier to support some failure scenarios. (It can be used for
> > example
> > > to
> > > > > manually reenable the reconcile loop when it got stuck in an error
> > > phase)
> > > > >
> > > > > I will go ahead and propose a PR using annotations then.
> > > > >
> > > > > Cheers,
> > > > > Matyas
> > > > >
> > > > > On Wed, Mar 2, 2022 at 12:47 PM Yang Wang 
> > > wrote:
> > > > >
> > > > > > I also like the annotation approach since it is more natural.
> > > > > > But I am not sure about whether the meta data change will trigger
> > an
> > > > > event
> > > > > > in java-operator-sdk.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Yang
> > > > > >
> > > > > > Gyula Fóra  于2022年3月2日周三 16:29写道：
> > > > > >
> > > > > > > Thanks Matyas,
> > > > > > >
> > > > > > > From a user perspective I think the annotation is pretty nice
> and
> > > > user
> > > > > > > friendly so I personally prefer that approach.
> > > > > > >
> > > > > > > You said:
> > > > > > >  "It seems, the java-operator-sdk handles the changes of the
> > > > .metadata
> > > > > > and
> > > > > > > .spec fields of custom resources differently."
> > > > > > >
> > > > > > > What implications does this have on the above mentioned 2
> > > approach

Re: [DISCUSS] Manual savepoint triggering in flink-kubernetes-operator

2024-03-12 Thread Gyula Fóra

That's definitely a good improvement Robert and we should add it at some
point. At the point in time when this was implemented we went with the
current simpler / more lightweight approach.
However if anyone is interested in working on this / contributing this
improvement I would personally support it.

Gyula

On Tue, Mar 12, 2024 at 3:53 PM Robert Metzger  wrote:

> Have you guys considered making savepoints a first class citizen in the
> Kubernetes operator?
> E.g. to trigger a savepoint, you create a "FlinkSavepoint" CR, the K8s
> operator picks up that resource and tries to create a savepoint
> indefinitely until the savepoint has been successfully created. We report
> the savepoint status and location in the "status" field.
>
> We could even add an (optional) finalizer to delete the physical savepoint
> from the savepoint storage once the "FlinkSavepoint" CR has been deleted.
> optional: the savepoint spec could contain a field "retain
> physical savepoint" or something, that controls the delete behavior.
>
>
> On Thu, Mar 3, 2022 at 4:02 AM Yang Wang  wrote:
>
> > I agree that we could start with the annotation approach and collect the
> > feedback at the same time.
> >
> > Best,
> > Yang
> >
> > Őrhidi Mátyás  于2022年3月2日周三 20:06写道：
> >
> > > Thank you for your feedback!
> > >
> > > The annotation on the
> > >
> > > @ControllerConfiguration(generationAwareEventProcessing = false)
> > > FlinkDeploymentController
> > >
> > > already enables the event triggering based on metadata changes. It was
> > set
> > > earlier to support some failure scenarios. (It can be used for example
> to
> > > manually reenable the reconcile loop when it got stuck in an error
> phase)
> > >
> > > I will go ahead and propose a PR using annotations then.
> > >
> > > Cheers,
> > > Matyas
> > >
> > > On Wed, Mar 2, 2022 at 12:47 PM Yang Wang 
> wrote:
> > >
> > > > I also like the annotation approach since it is more natural.
> > > > But I am not sure about whether the meta data change will trigger an
> > > event
> > > > in java-operator-sdk.
> > > >
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > Gyula Fóra  于2022年3月2日周三 16:29写道：
> > > >
> > > > > Thanks Matyas,
> > > > >
> > > > > From a user perspective I think the annotation is pretty nice and
> > user
> > > > > friendly so I personally prefer that approach.
> > > > >
> > > > > You said:
> > > > >  "It seems, the java-operator-sdk handles the changes of the
> > .metadata
> > > > and
> > > > > .spec fields of custom resources differently."
> > > > >
> > > > > What implications does this have on the above mentioned 2
> approaches?
> > > > Does
> > > > > it make one more difficult than the other?
> > > > >
> > > > > Cheers
> > > > > Gyula
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Mar 1, 2022 at 1:52 PM Őrhidi Mátyás <
> > matyas.orh...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi All!
> > > > > >
> > > > > > I'd like to start a quick discussion about the way we allow users
> > to
> > > > > > trigger savepoints manually in the operator [FLINK-26181]
> > > > > > <https://issues.apache.org/jira/browse/FLINK-26181>. There are
> > > > existing
> > > > > > solutions already for this functionality in other operators, for
> > > > example:
> > > > > > - counter based
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/spotify/flink-on-k8s-operator/blob/master/docs/savepoints_guide.md#2-taking-savepoints-by-updating-the-flinkcluster-custom-resource
> > > > > > >
> > > > > > - annotation based
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/spotify/flink-on-k8s-operator/blob/master/docs/savepoints_guide.md#3-taking-savepoints-by-attaching-annotation-to-the-flinkcluster-custom-resource
> > > > > > >
> > > > > >
> > > > > > We could implement any of these or both or come up with our own
> > > > approach.
> > > > > > It seems, the java-operator-sdk handles the changes of the
> > .metadata
> > > > and
> > > > > > .spec fields of custom resources differently. For further info
> see
> > > the
> > > > > > chapter Generation Awareness and Event Filtering in the docs
> > > > > > <https://javaoperatorsdk.io/docs/features>.
> > > > > >
> > > > > > Let me know what you think.
> > > > > >
> > > > > > Cheers,
> > > > > > Matyas
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-425: Asynchronous Execution Model

2024-03-10 Thread Gyula Fóra

Hey!

I may have missed this in the FLIP but did not find it spelled out
explicitly.

Can the user use the entire record processing context in their async
execution?
In other words can they access all functionality from the callback? For
example can they simply reference the keyed context and register a new
timer, get the timestamp etc?

If yes, does this mean that the context itself becomes "immutable" or the
context is switched in the background?

One example use-case would be that you get something from the state and
based on the value you register a timer.

Thanks,
Gyula

On Mon, Mar 11, 2024 at 3:58 AM Yanfei Lei  wrote:

> Hi Jing,
>
> Thanks for your thoughtful feedback!
>
> > does it mean that there will be three mails for Read, Update, and Output
> ?
>
> With this example, there are two mails. The Read is processed by
> `mailboxDefaultAction`[1], and the Update and Output are encapsulated
> as mail.
>
> > does it make sense to encapsulate one mail instead of 3 mails with more
> overhead?
>
> How many mails are encapsulated depends on how the user writes the
> code. The statements in a `then()` will be wrapped into a mail.
> StateFuture is a restricted version of CompletableFuture, their basic
> semantics are consistent.
>
> > Would you like to add more description for cases when exceptions
> happened? E.g. when reading or/and updating State throws IOExceptions.
>
> We describe this in the "Error handling"[2] section. This FLIP also
> adopts the design from FLIP-368, ensuring that all state interfaces
> throw unchecked exceptions and, consequently, do not declare any
> exceptions in their signatures. In cases where an exception occurs
> while accessing the state, the job should fail.
>
> > Is it correct to understand that AEC is stateless?
>
> Great perspective, yes, it can be understood that way.
> AEC is a task-level component. When the job fails or is restarted, the
> runtime status in AEC will be reset.
> In fact, we have considered taking a snapshot of the status in AEC and
> storing it in a checkpoint like "unaligned checkpoint", but since
> callback cannot be serialized, this idea is not feasible for the time
> being.
>
> > would you like to add Pseudo-code for the inFilghtReocordNum decrement
> to help us understand the logic better?
>
> This part of the code is a bit scattered, we will try to abstract a
> pseudo-code. You can first refer to the RecordContext-related code [3]
> in the PoC to understand it.
>
> [1]
> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/mailbox/MailboxProcessor.java#L81
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-ErrorHandling
> [3]
> https://github.com/ververica/flink-poc/blob/disagg-poc-2/flink-runtime/src/main/java/org/apache/flink/runtime/state/async/RecordContext.java#L77
>
>
> Best,
> Yanfei
>
> Jing Ge  于2024年3月10日周日 23:47写道：
> >
> > Hi Yanfei,
> >
> > Thanks for your proposal! The FLIP contains a lot of great new ideas. I'd
> > like to ask some questions to make sure we are on the same page.
> >
> > > For the asynchronous interface, Record A should run with Read, Update
> and
> > Output, while Record B should stay at the Blocking buffer.
> >
> > 1. With this example, does it mean that there will be three mails for
> Read,
> > Update, and Output ?
> > 2. If yes, since the Read, Update, and Output have to be executed before
> > Record B, does it make sense to encapsulate one mail instead of 3 mails
> > with more overhead? There must be some thoughts behind the design. Look
> > forward to it.
> >
> > > The challenge arises in determining when all the processing logic
> > associated with Record A is fully executed. To address this, we have
> > adopted a reference counting mechanism that tracks ongoing operations
> > (either processing input or executing a callback) related to a single
> > record.
> >
> > The design reminds me of the JVM reference counting for GC. Would you
> like
> > to add more description for cases when exceptions happened? E.g. when
> > reading or/and updating State throws IOExceptions.
> >
> > > In more detail, AEC uses a inFilghtReocordNum  variable to trace the
> > current number of records in progress. Every time the AEC receives a new
> > record, the inFilghtReocordNum  increases by 1; when all processing and
> > callback for this record have completed, the inFilghtReocordNum
> decreases
> > by 1. When processing one checkpoint mail, the current task thread will
> > give up the time slice through the yield() method of the mailbox
> executor,
> > so that the ongoing state request’s callback and the blocking state
> > requests will be drained first until inFlightRecordNum reduces to 0.
> >
> > 1. Speaking of draining, is it correct to understand that AEC is
> stateless?
> > E.g. AEC could be easily scaled out if it became a bottleneck.
> > 2. There are Pseudo-code for the

Re: [DISCUSS] FLIP-423 ～FLIP-428: Introduce Disaggregated State Storage and Management in Flink 2.0

2024-03-06 Thread Gyula Fóra

Hey all!

This is a massive improvement / work. I just started going through the
Flips and have a more or less meta comment.

While it's good to keep the overall architecture discussion here, I think
we should still have separate discussions for each FLIP where we can
discuss interface details etc. With so much content if we start adding
minor comments here that will lead to nowhere but those discussions are
still important and we should have them in separate threads (one for each
FLIP)

What do you think?
Gyula

On Wed, Mar 6, 2024 at 8:50 AM Yanfei Lei  wrote:

> Hi team,
>
> Thanks for your discussion. Regarding FLIP-425, we have supplemented
> several updates to answer high-frequency questions:
>
> 1. We captured a flame graph of the Hashmap state backend in
> "Synchronous execution with asynchronous APIs"[1], which reveals that
> the framework overhead (including reference counting, future-related
> code and so on) consumes about 9% of the keyed operator CPU time.
> 2. We added a set of comparative experiments for watermark processing,
> the performance of Out-Of-Order mode is 70% better than
> strictly-ordered mode under ~140MB state size. Instructions on how to
> run this test have also been added[2].
> 3. Regarding the order of StreamRecord, whether it has state access or
> not. We supplemented a new *Strict order of 'processElement'*[3].
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-SynchronousexecutionwithasynchronousAPIs
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-Strictly-orderedmodevs.Out-of-ordermodeforwatermarkprocessing
> [3]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-ElementOrder
>
>
> Best regards,
> Yanfei
>
> Yunfeng Zhou  于2024年3月5日周二 09:25写道：
> >
> > Hi Zakelly,
> >
> > > 5. I'm not very sure ... revisiting this later since it is not
> important.
> >
> > It seems that we still have some details to confirm about this
> > question. Let's postpone this to after the critical parts of the
> > design are settled.
> >
> > > 8. Yes, we had considered ... metrics should be like afterwards.
> >
> > Oh sorry I missed FLIP-431. I'm fine with discussing this topic in
> milestone 2.
> >
> > Looking forward to the detailed design about the strict mode between
> > same-key records and the benchmark results about the epoch mechanism.
> >
> > Best regards,
> > Yunfeng
> >
> > On Mon, Mar 4, 2024 at 7:59 PM Zakelly Lan 
> wrote:
> > >
> > > Hi Yunfeng,
> > >
> > > For 1:
> > > I had a discussion with Lincoln Lee, and I realize it is a common case
> the same-key record should be blocked before the `processElement`. It is
> easier for users to understand. Thus I will introduce a strict mode for
> this and make it default. My rough idea is just like yours, by invoking
> some method of AEC instance before `processElement`. The detailed design
> will be described in FLIP later.
> > >
> > > For 2:
> > > I agree with you. We could throw exceptions for now and optimize this
> later.
> > >
> > > For 5:
> > >>
> > >> It might be better to move the default values to the Proposed Changes
> > >> section instead of making them public for now, as there will be
> > >> compatibility issues once we want to dynamically adjust the thresholds
> > >> and timeouts in future.
> > >
> > > Agreed. The whole framework is under experiment until we think it is
> complete in 2.0 or later. The default value should be better determined
> with more testing results and production experience.
> > >
> > >> The configuration execution.async-state.enabled seems unnecessary, as
> > >> the infrastructure may automatically get this information from the
> > >> detailed state backend configurations. We may revisit this part after
> > >> the core designs have reached an agreement. WDYT?
> > >
> > >
> > > I'm not very sure if there is any use case where users write their
> code using async APIs but run their job in a synchronous way. The first two
> scenarios that come to me are for benchmarking or for a small state, while
> they don't want to rewrite their code. Actually it is easy to support, so
> I'd suggest providing it. But I'm fine with revisiting this later since it
> is not important. WDYT?
> > >
> > > For 8:
> > > Yes, we had considered the I/O metrics group especially the
> back-pressure, idle and task busy per second. In the current plan we can do
> state access during back-pressure, meaning that those metrics for input
> would better be redefined. I suggest we discuss these existing metrics as
> well as some new metrics that should be introduced in FLIP-431 later in
> milestone 2, since we have basically finished the framework thus we will
> have a better view of what metrics should be like afterwards. WDYT?
> > >
> > >
> > > Best,
> > > Zakelly
> > >
> > > On Mon, Mar 4,

Re: [DISCUSS] FLIP-433: State Access on DataStream API V2

2024-03-06 Thread Gyula Fóra

Hi Weijie!

Thank you for the proposal.

I have some initial questions to start the discussion:

1. What is the semantics of the usesStates method? When is it called? Can
the used state change dynamically at runtime? Can the logic depend on
something computed in open(..) for example?

Currently state access is pretty dynamic in Flink and I would assume many
jobs create states on the fly based on some required logic. Are we planning
to address these use-cases?

Are we planning to support deleting/dropping states that are not required
anymore?

2. Get state now returns an optional, but you mention that:
" If you want to get a state that is not declared or has no access,
Option#empty is returned."

I think if a state is not declared or otherwise cannot be accessed, an
exceptions must be thrown. We cannot confuse empty value with something
inaccessible.

3. The RedistributionMode enum sounds a bit strange to me, as it doesn't
actually specify a mode of redistribution. It feels more like a flag. Can
we simply have an Optional instead?

4. BroadcastStates are currently very limited by only Map-like states, and
the new interface also enforces that.
Can we remove this limitation? If not, should broadcastState declaration
extend mapstate declaration?

Cheers,
Gyula

Cheers
Gyuka

On Wed, Mar 6, 2024 at 11:18 AM weijie guo 
wrote:

> Hi devs,
>
> I'd like to start a discussion about FLIP-433: State Access on
> DataStream API V2
> [1]. This is the third sub-FLIP of DataStream API V2.
>
>
> After FLIP-410 [2], we can already write a simple stateless job using the
> DataStream V2 API.  But as we all know, stateful computing is Flink's trump
> card. In this FLIP, we will discuss how to declare and access state on
> DataStream API V2 and we manage to avoid some of the shortcomings of V1 in
> this regard.
>
> You can find more details in this FLIP. Its relationship with other
> sub-FLIPs can be found in the umbrella FLIP
> [3]. Looking forward to hearing from you, thanks!
>
>
> Best regards,
>
> Weijie
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-433%3A+State+Access+on+DataStream+API+V2
>
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-410%3A++Config%2C+Context+and+Processing+Timer+Service+of+DataStream+API+V2
>
> [3]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-408%3A+%5BUmbrella%5D+Introduce+DataStream+API+V2
>

Re: Temporal join on rolling aggregate

2024-03-05 Thread Gyula Fóra

Hi Everyone!

I have discussed this with Sébastien Chevalley, he is going to prepare and
drive the FLIP while I will assist him along the way.

Thanks
Gyula

On Tue, Mar 5, 2024 at 9:57 AM  wrote:

> I do agree with Ron Liu.
> This would definitely need a FLIP as it would impact SQL and extend it
> with the equivalent of TimestampAssigners in the Java API.
>
> Is there any existing JIRA here, or is anybody willing to drive a FLIP?
> On Feb 26, 2024 at 02:36 +0100, Ron liu , wrote:
>
> +1,
> But I think this should be a more general requirement, that is, support for
> declaring watermarks in query, which can be declared for any type of
> source, such as table, view. Similar to databricks provided [1], this needs
> a FLIP.
>
> [1]
>
> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-qry-select-watermark.html
>
> Best,
> Ron
>
>

Re: [VOTE] FLIP-314: Support Customized Job Lineage Listener

2024-02-28 Thread Gyula Fóra

+1 (binding)

Gyula

On Wed, Feb 28, 2024 at 11:10 AM Maciej Obuchowski 
wrote:

> +1 (non-binding)
>
> Best,
> Maciej Obuchowski
>
> śr., 28 lut 2024 o 10:29 Zhanghao Chen 
> napisał(a):
>
> > +1 (non-binding)
> >
> > Best,
> > Zhanghao Chen
> > 
> > From: Yong Fang 
> > Sent: Wednesday, February 28, 2024 10:12
> > To: dev 
> > Subject: [VOTE] FLIP-314: Support Customized Job Lineage Listener
> >
> > Hi devs,
> >
> > I would like to restart a vote about FLIP-314: Support Customized Job
> > Lineage Listener[1].
> >
> > Previously, we added lineage related interfaces in FLIP-314. Before the
> > interfaces were developed and merged into the master, @Maciej and
> > @Zhenqiu provided valuable suggestions for the interface from the
> > perspective of the lineage system. So we updated the interfaces of
> FLIP-314
> > and discussed them again in the discussion thread [2].
> >
> > So I am here to initiate a new vote on FLIP-314, the vote will be open
> for
> > at least 72 hours unless there is an objection or insufficient votes
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener
> > [2] https://lists.apache.org/thread/wopprvp3ww243mtw23nj59p57cghh7mc
> >
> > Best,
> > Fang Yong
> >
>

Re: Temporal join on rolling aggregate

2024-02-22 Thread Gyula Fóra

Posting this to dev as well as it potentially has some implications on
development effort.

What seems to be the problem here is that we cannot control/override
Timestamps/Watermarks/Primary key on VIEWs. It's understandable that you
cannot create a PRIMARY KEY on the view but I think the temporal join also
should not require the PK, should we remove this limitation?

The general problem is the inflexibility of the timestamp/watermark
handling on query outputs, which makes this again impossible.

The workaround here can be to write the rolling aggregate to Kafka, read it
back again and join with that. The fact that this workaround is possible
actually highlights the need for more flexibility on the query/view side in
my opinion.

Has anyone else run into this issue and considered the proper solution to
the problem? Feels like it must be pretty common :)

Cheers,
Gyula

On Wed, Feb 21, 2024 at 10:29 PM Sébastien Chevalley 
wrote:

> Hi,
>
> I have been trying to write a temporal join in SQL done on a rolling
> aggregate view. However it does not work and throws :
>
> org.apache.flink.table.api.ValidationException: Event-Time Temporal Table
> Join requires both primary key and row time attribute in versioned table,
> but no row time attribute can be found.
>
> It seems that after the aggregation, the table looses the watermark and
> it's not possible to add one with the SQL API as it's a view.
>
> CREATE TABLE orders (
> order_id INT,
> price DECIMAL(6, 2),
> currency_id INT,
> order_time AS NOW(),
> WATERMARK FOR order_time AS order_time - INTERVAL '2' SECOND
> )
> WITH (
> 'connector' = 'datagen',
> 'rows-per-second' = '10',
> 'fields.order_id.kind' = 'sequence',
> 'fields.order_id.start' = '1',
> 'fields.order_id.end' = '10',
> 'fields.currency_id.min' = '1',
> 'fields.currency_id.max' = '20'
> );
>
> CREATE TABLE currency_rates (
> currency_id INT,
> conversion_rate DECIMAL(4, 3),
> PRIMARY KEY (currency_id) NOT ENFORCED
> )
> WITH (
> 'connector' = 'datagen',
> 'rows-per-second' = '10',
> 'fields.currency_id.min' = '1',
> 'fields.currency_id.max' = '20'
> );
>
> CREATE TEMPORARY VIEW max_rates AS (
> SELECT
> currency_id,
> MAX(conversion_rate) AS max_rate
> FROM currency_rates
> GROUP BY currency_id
> );
>
> CREATE TEMPORARY VIEW temporal_join AS (
> SELECT
> order_id,
> max_rates.max_rate
> FROM orders
>  LEFT JOIN max_rates FOR SYSTEM_TIME AS OF orders.order_time
>  ON orders.currency_id = max_rates.currency_id
> );
>
> SELECT * FROM temporal_join;
>
> Am I missing something? What would be a good starting point to address
> this?
>
> Thanks in advance,
> Sébastien Chevalley

Re: FLINK-21672

2024-02-17 Thread Gyula Fóra

Makes sense to defer this to the next release

On Fri, 16 Feb 2024 at 17:52, Martijn Visser 
wrote:

> Well I wouldn't be too comfortable with merging a change about how
> Flink works on different JDKs at this stage, especially since we
> already have to test for 4 different JDKs at this point.
>
> On Fri, Feb 16, 2024 at 5:00 PM Gyula Fóra  wrote:
> >
> > Depending on the scope of this change, it may not be considered a feature
> > right Martijn?
> > If it's a test improvement, can it still be part of the release?
> >
> > Gyula
> >
> > On Fri, Feb 16, 2024 at 4:45 PM Martijn Visser  >
> > wrote:
> >
> > > Hi David,
> > >
> > > Happy to assign it to you. It can't be merged for Flink 1.19 anymore
> > > though: feature freeze has started for that one, as announced
> > > previously on the mailing list.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > > On Fri, Feb 16, 2024 at 4:32 PM David Radley 
> > > wrote:
> > > >
> > > > Hi,
> > > > I see https://issues.apache.org/jira/browse/FLINK-21672 has been
> open
> > > for a while. We at IBM are building Flink with the latest v11  Semeru
> JDK (
> > > https://developer.ibm.com/languages/java/semeru-runtimes/).
> > > > Flink fails to build with skipTests. It fails because
> > > sun.management.VMManagement class
> > > > Cannot be found at build time. I see some logic in the Flink code to
> > > tolerate the lack of com.sun packages, but not this sun package. We
> get:
> > > >
> > > >
> > > > ERROR] Failed to execute goal
> > > org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile
> > > (default-compile) on project flink-local-recovery-and-allocation-test:
> > > Compilation failure: Compilation failure:
> > > >
> > > > [ERROR]
> > >
> /Users/davidradley/flinkapicurio/flink-end-to-end-tests/flink-local-recovery-and-allocation-test/src/main/java/org/apache/flink/streaming/tests/StickyAllocationAndLocalRecoveryTestJob.java:[418,23]
> > > cannot find symbol
> > > >
> > > > [ERROR]   symbol:   class VMManagement
> > > >
> > > > [ERROR]   location: package sun.management
> > > >
> > > > [ERROR]
> > >
> /Users/davidradley/flinkapicurio/flink-end-to-end-tests/flink-local-recovery-and-allocation-test/src/main/java/org/apache/flink/streaming/tests/StickyAllocationAndLocalRecoveryTestJob.java:[418,59]
> > > cannot find symbol
> > > >
> > > > [ERROR]   symbol:   class VMManagement
> > > >
> > > > [ERROR]   location: package sun.management
> > > >
> > > >
> > > > As per the link in the issue, sun. packages are not supported or
> part of
> > > the JDK after java 1.7.
> > > >
> > > > I would like to have the priority raised on this Jira and would like
> to
> > > change the code so it builds successfully by  removing the dependency
> on
> > > this old / unsupported sun package . I am happy to work on this, if
> you are
> > > willing to support this by assigning me the Jira and merging the fix;
> > > ideally we would like this to be in the next release - Flink 1.19,
> > > >  Kind regards, David.
> > > >
> > > > Unless otherwise stated above:
> > > >
> > > > IBM United Kingdom Limited
> > > > Registered in England and Wales with number 741598
> > > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6
> 3AU
> > >
>

Re: FLINK-21672

2024-02-16 Thread Gyula Fóra

Depending on the scope of this change, it may not be considered a feature
right Martijn?
If it's a test improvement, can it still be part of the release?

Gyula

On Fri, Feb 16, 2024 at 4:45 PM Martijn Visser 
wrote:

> Hi David,
>
> Happy to assign it to you. It can't be merged for Flink 1.19 anymore
> though: feature freeze has started for that one, as announced
> previously on the mailing list.
>
> Best regards,
>
> Martijn
>
> On Fri, Feb 16, 2024 at 4:32 PM David Radley 
> wrote:
> >
> > Hi,
> > I see https://issues.apache.org/jira/browse/FLINK-21672 has been open
> for a while. We at IBM are building Flink with the latest v11  Semeru JDK (
> https://developer.ibm.com/languages/java/semeru-runtimes/).
> > Flink fails to build with skipTests. It fails because
> sun.management.VMManagement class
> > Cannot be found at build time. I see some logic in the Flink code to
> tolerate the lack of com.sun packages, but not this sun package. We get:
> >
> >
> > ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile
> (default-compile) on project flink-local-recovery-and-allocation-test:
> Compilation failure: Compilation failure:
> >
> > [ERROR]
> /Users/davidradley/flinkapicurio/flink-end-to-end-tests/flink-local-recovery-and-allocation-test/src/main/java/org/apache/flink/streaming/tests/StickyAllocationAndLocalRecoveryTestJob.java:[418,23]
> cannot find symbol
> >
> > [ERROR]   symbol:   class VMManagement
> >
> > [ERROR]   location: package sun.management
> >
> > [ERROR]
> /Users/davidradley/flinkapicurio/flink-end-to-end-tests/flink-local-recovery-and-allocation-test/src/main/java/org/apache/flink/streaming/tests/StickyAllocationAndLocalRecoveryTestJob.java:[418,59]
> cannot find symbol
> >
> > [ERROR]   symbol:   class VMManagement
> >
> > [ERROR]   location: package sun.management
> >
> >
> > As per the link in the issue, sun. packages are not supported or part of
> the JDK after java 1.7.
> >
> > I would like to have the priority raised on this Jira and would like to
> change the code so it builds successfully by  removing the dependency on
> this old / unsupported sun package . I am happy to work on this, if you are
> willing to support this by assigning me the Jira and merging the fix;
> ideally we would like this to be in the next release - Flink 1.19,
> >  Kind regards, David.
> >
> > Unless otherwise stated above:
> >
> > IBM United Kingdom Limited
> > Registered in England and Wales with number 741598
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
>

Re: [DISCUSS] Kubernetes Operator 1.8.0 release planning

2024-02-06 Thread Gyula Fóra

Given the proposed timeline was a bit short / rushed I agree with Max that
we could wait 1-2 more weeks to wrap up the current outstanding bigger
features around memory tuning and the JDBC state store.

In the meantime it would be great to involve 1-2 new committers (or other
contributors) in the operator release process so that we have some fresh
eyes on the process.
Would anyone be interested in volunteering to help with the next release?

Cheers,
Gyula

On Tue, Feb 6, 2024 at 4:35 PM Maximilian Michels  wrote:

> Thanks for starting the discussion Gyula!
>
> It comes down to how important the outstanding changes are for the
> release. Both the memory tuning as well as the JDBC changes probably
> need 1-2 weeks realistically to complete the initial spec. For the
> memory tuning, I would prefer merging it in the current state as an
> experimental feature for the release which comes disabled out of the
> box. The reason is that it can already be useful to users who want to
> try it out; we have seen some interest in it. Then for the next
> release we will offer a richer feature set and might enable it by
> default.
>
> Cheers,
> Max
>
> On Tue, Feb 6, 2024 at 10:53 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > Thanks Gyula for driving this release!
> >
> > Release 1.8.0 sounds make sense to me.
> >
> > As you said, I'm developing the JDBC event handler.
> > Since I'm going on vacation starting this Friday, and I have some
> > other work before I go on vacation. After evaluating my time today,
> > I found that I cannot complete the development, testing, and merging
> > of the JDBC event handler this week. So I tend to put the JDBC
> > event handler in the next version.
> >
> > Best,
> > Rui
> >
> > On Mon, Feb 5, 2024 at 11:42 PM Gyula Fóra  wrote:
> >
> > > Hi all!
> > >
> > > I would like to kick off the release planning for the operator 1.8.0
> > > release. The last operator release was November 22 last year. Since
> then we
> > > have added a number of fixes and improvements to both the operator and
> the
> > > autoscaler logic.
> > >
> > > There are a few outstanding PRs currently, including some larger
> features
> > > for the Autoscaler (JDBC event handler, Heap tuning), we have to make a
> > > decision regarding those as well whether to include in the release or
> not. @Maximilian
> > > Michels  , @Rui Fan <1996fan...@gmail.com> what's your
> > > take regarding those PRs? I generally like to be a bit more
> conservative
> > > with large new features to avoid introducing last minute instabilities.
> > >
> > > My proposal would be to aim for the end of this week as the freeze date
> > > (Feb 9) and then we can prepare RC1 on monday.
> > >
> > > I am happy to volunteer as a release manager but I am of course open to
> > > working together with someone on this.
> > >
> > > What do you think?
> > >
> > > Cheers,
> > > Gyula
> > >
>

[DISCUSS] Kubernetes Operator 1.8.0 release planning

2024-02-05 Thread Gyula Fóra

Hi all!

I would like to kick off the release planning for the operator 1.8.0
release. The last operator release was November 22 last year. Since then we
have added a number of fixes and improvements to both the operator and the
autoscaler logic.

There are a few outstanding PRs currently, including some larger features
for the Autoscaler (JDBC event handler, Heap tuning), we have to make a
decision regarding those as well whether to include in the release or
not. @Maximilian
Michels  , @Rui Fan <1996fan...@gmail.com> what's your take
regarding those PRs? I generally like to be a bit more conservative with
large new features to avoid introducing last minute instabilities.

My proposal would be to aim for the end of this week as the freeze date
(Feb 9) and then we can prepare RC1 on monday.

I am happy to volunteer as a release manager but I am of course open to
working together with someone on this.

What do you think?

Cheers,
Gyula

Re: DataOutputSerializer serializing long UTF Strings

2024-01-23 Thread Gyula Fóra

Hi Peter!

I think this is a good additional serialization utility to Flink that may
benefit different data formats / connectors in the future.

+1

Cheers,
Gyula

On Mon, Jan 22, 2024 at 8:04 PM Steven Wu  wrote:

> I think this is a reasonable extension to `DataOutputSerializer`. Although
> 64 KB is not small, it is still possible to have long strings over that
> limit. There are already precedents of extended APIs
> `DataOutputSerializer`. E.g.
>
> public void setPosition(int position) {
> Preconditions.checkArgument(
> position >= 0 && position <= this.position, "Position out
> of bounds.");
> this.position = position;
> }
>
> public void setPositionUnsafe(int position) {
> this.position = position;
> }
>
>
> On Fri, Jan 19, 2024 at 2:51 AM Péter Váry 
> wrote:
>
> > Hi Team,
> >
> > During the root cause analysis of an Iceberg serialization issue [1], we
> > have found that *DataOutputSerializer.writeUTF* has a hard limit on the
> > length of the string (64k). This is inherited from the
> > *DataOutput.writeUTF*
> > method, where the JDK specifically defines this limit [2].
> >
> > For our use-case we need to enable the possibility to serialize longer
> UTF
> > strings, so we will need to define a *writeLongUTF* method with a similar
> > specification than the *writeUTF*, but without the length limit.
> >
> > My question is:
> > - Is it something which would be useful for every Flink user? Shall we
> add
> > this method to *DataOutputSerializer*?
> > - Is it very specific for Iceberg, and we should keep it in Iceberg
> > connector code?
> >
> > Thanks,
> > Peter
> >
> > [1] - https://github.com/apache/iceberg/issues/9410
> > [2] -
> >
> >
> https://docs.oracle.com/javase/8/docs/api/java/io/DataOutput.html#writeUTF-java.lang.String-
> >
>

Re: Re: Re: [VOTE] Accept Flink CDC into Apache Flink

2024-01-14 Thread Gyula Fóra

+1 (binding)

On Sun, Jan 14, 2024 at 9:01 AM Yun Tang  wrote:

> +1 (non-binding)
>
> On 2024/01/14 02:26:13 Yun Gao wrote:
> > +1 (binding)
> >
> > On Sat, Jan 13, 2024 at 10:08 AM Rodrigo Meneses 
> wrote:
> > >
> > > +1 (non binding)
> > >
> > > On Fri, Jan 12, 2024 at 5:45 PM Dong Lin  wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > On Sat, Jan 13, 2024 at 6:04 AM Austin Bennett 
> wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > On Fri, Jan 12, 2024 at 5:44 PM Becket Qin 
> wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jiangjie (Becket) Qin
> > > > > >
> > > > > > On Fri, Jan 12, 2024 at 5:58 AM Zhijiang <
> wangzhijiang...@aliyun.com
> > > > > > .invalid>
> > > > > > wrote:
> > > > > >
> > > > > > > +1 (binding)
> > > > > > > Best,
> > > > > > > Zhijiang
> > > > > > >
> --
> > > > > > > From:Kurt Yang 
> > > > > > > Send Time:2024年1月12日(星期五) 15:31
> > > > > > > To:dev
> > > > > > > Subject:Re: Re: Re: [VOTE] Accept Flink CDC into Apache Flink
> > > > > > > +1 (binding)
> > > > > > > Best,
> > > > > > > Kurt
> > > > > > > On Fri, Jan 12, 2024 at 2:21 PM Hequn Cheng 
> > > > wrote:
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Hequn
> > > > > > > >
> > > > > > > > On Fri, Jan 12, 2024 at 2:19 PM godfrey he <
> godfre...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 (binding)
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Godfrey
> > > > > > > > >
> > > > > > > > > Zhu Zhu  于2024年1月12日周五 14:10写道：
> > > > > > > > > >
> > > > > > > > > > +1 (binding)
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Zhu
> > > > > > > > > >
> > > > > > > > > > Hangxiang Yu  于2024年1月11日周四
> 14:26写道：
> > > > > > > > > >
> > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Jan 11, 2024 at 11:19 AM Xuannan Su <
> > > > > > suxuanna...@gmail.com
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Xuannan
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Jan 11, 2024 at 10:28 AM Xuyang <
> > > > xyzhong...@163.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > +1 (non-binding)--
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best！
> > > > > > > > > > > > > Xuyang
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 在 2024-01-11 10:00:11，"Yang Wang" <
> > > > wangyang0...@apache.org
> > > > > >
> > > > > > > 写道：
> > > > > > > > > > > > > >+1 (binding)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >Best,
> > > > > > > > > > > > > >Yang
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >On Thu, Jan 11, 2024 at 9:53 AM liu ron <
> > > > > ron9@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >> +1 non-binding
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Best
> > > > > > > > > > > > > >> Ron
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Matthias Pohl 
> > > > > > > 于2024年1月10日周三
> > > > > > > > > > > 23:05写道：
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> > +1 (binding)
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > On Wed, Jan 10, 2024 at 3:35 PM ConradJam <
> > > > > > > > > jam.gz...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > > +1 non-binding
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > Dawid Wysakowicz 
> > > > > > 于2024年1月10日周三
> > > > > > > > > > > 21:06写道：
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > > +1 (binding)
> > > > > > > > > > > > > >> > > > Best,
> > > > > > > > > > > > > >> > > > Dawid
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > On Wed, 10 Jan 2024 at 11:54, Piotr
> Nowojski <
> > > > > > > > > > > > pnowoj...@apache.org>
> > > > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > > +1 (binding)
> > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > >> > > > > śr., 10 sty 2024 o 11:25 Martijn Visser
> <
> > > > > > > > > > > > martijnvis...@apache.org>
> > > > > > > > > > > > > >> > > > > napisał(a):
> > > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > > >> > > > > > +1 (binding)
> > > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > > >> > > > > > On Wed, Jan 10, 2024 at 4:43 AM Xingbo
> > > > Huang <
> > > > > > > > > > > > hxbks...@gmail.com
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > > >> > > > > > > +1 (binding)
> > > > > > > > > > > > > >> > > > >

Re: Flink pending record metric weired after autoscaler rescaling

2024-01-12 Thread Gyula Fóra

Could this be related to the issue reported here?
https://issues.apache.org/jira/browse/FLINK-34063

Gyula

On Wed, Jan 10, 2024 at 4:04 PM Yang LI  wrote:

> Just to give more context, my setup uses Apache Flink 1.18 with the
> adaptive scheduler enabled, issues happen randomly particularly
> post-restart behaviors.
>
> After each restart, the system logs indicate "Adding split(s) to reader:",
> signifying the reassignment of partitions across different TaskManagers. An
> anomaly arises with specific partitions, for example, partition-10. This
> partition does not appear in the logs immediately post-restart. It remains
> unlogged for several hours, during which no data consumption from
> partition-10 occurs. Subsequently, the logs display "Discovered new
> partitions:", and only then does the consumption of data from partition-10
> recommence.
>
> Could you provide any insights or hypotheses regarding the underlying cause
> of this delayed recognition and processing of certain partitions?
>
> Best regards,
> Yang
>
> On Mon, 8 Jan 2024 at 16:24, Yang LI  wrote:
>
> > Dear Flink Community,
> >
> > I've encountered an issue during the testing of my Flink autoscaler. It
> > appears that Flink is losing track of specific Kafka partitions, leading
> to
> > a persistent increase in lag on these partitions. The logs indicate a
> > 'kafka connector metricGroup name collision exception.' Notably, the
> > consumption on these Kafka partitions returns to normal after restarting
> > the Kafka broker. For context, I have enabled in-place rescaling support
> > with 'jobmanager.scheduler: Adaptive.'
> >
> > I suspect the problem may stem from:
> >
> > The in-place rescaling support triggering restarts of some taskmanagers.
> > This process might not be restarting the metric groups registered by the
> > Kafka source connector correctly, leading to a name collision exception
> and
> > preventing Flink from accurately reporting metrics related to Kafka
> > consumption.
> > A potential edge case in the metric for pending records, especially when
> > different partitions exhibit varying lags. This discrepancy might be
> > causing the pending record metric to malfunction.
> > I would appreciate your insights on these observations.
> >
> > Best regards,
> > Yang LI
> >
>

Re: [VOTE] FLIP-372: Allow TwoPhaseCommittingSink WithPreCommitTopology to alter the type of the Committable

2023-12-18 Thread Gyula Fóra

+1 (binding)

Gyula

On Mon, 18 Dec 2023 at 13:04, Márton Balassi 
wrote:

> +1 (binding)
>
> On Mon 18. Dec 2023 at 09:34, Péter Váry 
> wrote:
>
> > Hi everyone,
> >
> > Since there were no further comments on the discussion thread [1], I
> would
> > like to start the vote for FLIP-372 [2].
> >
> > The FLIP started as a small new feature, but in the discussion thread and
> > in a similar parallel thread [3] we opted for a somewhat bigger change in
> > the Sink V2 API.
> >
> > Please read the FLIP and cast your vote.
> >
> > The vote will remain open for at least 72 hours and only concluded if
> there
> > are no objections and enough (i.e. at least 3) binding votes.
> >
> > Thanks,
> > Peter
> >
> > [1] - https://lists.apache.org/thread/344pzbrqbbb4w0sfj67km25msp7hxlyd
> > [2] -
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-372%3A+Allow+TwoPhaseCommittingSink+WithPreCommitTopology+to+alter+the+type+of+the+Committable
> > [3] - https://lists.apache.org/thread/h6nkgth838dlh5s90sd95zd6hlsxwx57
> >
>

Re: [DISCUSS] Should Configuration support getting value based on String key?

2023-12-14 Thread Gyula Fóra

I see a strong value for user facing configs to use ConfigOption and this
should definitely be an enforced convention.

However with the Flink project growing and many other components and even
users using the Configuration object I really don’t think that we should
“force” this on the users/developers.

If we make fromMap / toMap free with basically no overhead, that is fine
but otherwise I think it would hurt the user experience to remove the
simple getters / setters. Temporary configoptions to access strings from
what is practically string->string map is exactly the kind of unnecessary
boilerplate that every dev and user wants to avoid.

Gyula

There are many cases where the features of the configoption are really not
needed.

On Thu, 14 Dec 2023 at 09:38, Xintong Song  wrote:

> Hi Gyula,
>
> First of all, even if we remove the `getXXX(String key, XXX defaultValue)`
> methods, there are still several ways to access the configuration with
> string-keys.
>
>- If one wants to access a specific option, as Rui mentioned,
>`ConfigOptions.key("xxx").stringType().noDefaultValue()` can be used.
> TBH,
>I can't think of a use case where a temporally created ConfigOption is
>preferred over a predefined one. Do you have any examples for that?
>- If one wants to access the whole configuration set, then `toMap` or
>`iterator` might be helpful.
>
> It is true that these ways are less convenient than `getXXX(String key, XXX
> defaultValue)`, and that's exactly my purpose, to make the key-string less
> convenient so that developers choose ConfigOption over it whenever is
> possible.
>
> there will always be cases where a more flexible
> > dynamic handling is necessary without the added overhead of the toMap
> logic
> >
>
> I'm not sure about this. I agree there are cases where flexible and dynamic
> handling is needed, but maybe "without the added overhead of the toMap
> logic" is not that necessary?
>
> I'd think of this as "encouraging developers to use ConfigOption as much as
> possible" vs. "a bit less convenient in 5% of the cases". I guess there's
> no right and wrong, just different engineer opinions. While I'm personally
> stand with removing the string-key access methods, I'd also be fine with
> the other way if there are more people in favor of it.
>
> Best,
>
> Xintong
>
>
>
> On Thu, Dec 14, 2023 at 3:45 PM Gyula Fóra  wrote:
>
> > Hi Xintong,
> >
> > I don’t really see the actual practical benefit from removing the
> getstring
> > and setstring low level methods.
> >
> > I understand that ConfigOptions are nicer for 95% of the cases but from a
> > technical point of view there will always be cases where a more flexible
> > dynamic handling is necessary without the added overhead of the toMap
> > logic.
> >
> > I think it’s the most natural thing for any config abstraction to expose
> > basic get set methods with a simple key.
> >
> > What do you think?
> >
> > Cheers
> > Gyula
> >
> > On Thu, 14 Dec 2023 at 08:00, Xintong Song 
> wrote:
> >
> > > >
> > > > IIUC, you both prefer using ConfigOption instead of string keys for
> > > > all use cases, even internal ones. We can even forcefully delete
> > > > these @Depreated methods in Flink-2.0 to guide users or
> > > > developers to use ConfigOption.
> > > >
> > >
> > > Yes, at least from my side.
> > >
> > >
> > > I noticed that Configuration is used in
> > > > DistributedCache#writeFileInfoToConfig and readFileInfoFromConfig
> > > > to store some cacheFile meta-information. Their keys are
> > > > temporary(key name with number) and it is not convenient
> > > > to predefine ConfigOption.
> > > >
> > >
> > > True, this one requires a bit more effort to migrate from string-key to
> > > ConfigOption, but still should be doable. Looking at how the two
> > mentioned
> > > methods are implemented and used, it seems what is really needed is
> > > serialization and deserialization of `DistributedCacheEntry`-s. And all
> > the
> > > entries are always written / read at once. So I think we can serialize
> > the
> > > whole set of entries into a JSON string (or something similar), and use
> > one
> > > ConfigOption with a deterministic key for it, rather than having one
> > > ConfigOption for each field in each entry. WDYT?
> > >
> > >
> > > If everyone agrees with this direction, we can start to refactor all
> > > > code that use

Re: [DISCUSS] Should Configuration support getting value based on String key?

2023-12-13 Thread Gyula Fóra

Hi Xintong,

I don’t really see the actual practical benefit from removing the getstring
and setstring low level methods.

I understand that ConfigOptions are nicer for 95% of the cases but from a
technical point of view there will always be cases where a more flexible
dynamic handling is necessary without the added overhead of the toMap logic.

I think it’s the most natural thing for any config abstraction to expose
basic get set methods with a simple key.

What do you think?

Cheers
Gyula

On Thu, 14 Dec 2023 at 08:00, Xintong Song  wrote:

> >
> > IIUC, you both prefer using ConfigOption instead of string keys for
> > all use cases, even internal ones. We can even forcefully delete
> > these @Depreated methods in Flink-2.0 to guide users or
> > developers to use ConfigOption.
> >
>
> Yes, at least from my side.
>
>
> I noticed that Configuration is used in
> > DistributedCache#writeFileInfoToConfig and readFileInfoFromConfig
> > to store some cacheFile meta-information. Their keys are
> > temporary(key name with number) and it is not convenient
> > to predefine ConfigOption.
> >
>
> True, this one requires a bit more effort to migrate from string-key to
> ConfigOption, but still should be doable. Looking at how the two mentioned
> methods are implemented and used, it seems what is really needed is
> serialization and deserialization of `DistributedCacheEntry`-s. And all the
> entries are always written / read at once. So I think we can serialize the
> whole set of entries into a JSON string (or something similar), and use one
> ConfigOption with a deterministic key for it, rather than having one
> ConfigOption for each field in each entry. WDYT?
>
>
> If everyone agrees with this direction, we can start to refactor all
> > code that uses getXxx(String key, String defaultValue) into
> > getXxx(ConfigOption configOption), and completely
> > delete all getXxx(String key, String defaultValue) in 2.0.
> > And I'm willing to pick it up~
> >
>
> Exactly. Actually, Xuannan and a few other colleagues are working on
> reviewing all the existing configuration options. We identified some common
> issues that can potentially be improved, and not using string-key is one of
> them. We are still summarizing the findings, and will bring them to the
> community for discussion once ready. Helping hands is definitely welcomed.
> :)
>
>
> Yeah, `toMap` can solve the problem, and I also mentioned it in the
> > initial mail.
> >
>
> True. Sorry I overlooked it.
>
>
> Best,
>
> Xintong
>
>
>
> On Thu, Dec 14, 2023 at 2:14 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Thanks Xintong and Xuannan for the feedback!
> >
> > IIUC, you both prefer using ConfigOption instead of string keys for
> > all use cases, even internal ones. We can even forcefully delete
> > these @Depreated methods in Flink-2.0 to guide users or
> > developers to use ConfigOption.
> >
> > Using ConfigOption is feasible in all scenarios because ConfigOption
> > can be easily created via
> > `ConfigOptions.key("xxx").stringType().noDefaultValue()` even if
> > there is no predefined ConfigOption.
> >
> > I noticed that Configuration is used in
> > DistributedCache#writeFileInfoToConfig and readFileInfoFromConfig
> > to store some cacheFile meta-information. Their keys are
> > temporary(key name with number) and it is not convenient
> > to predefine ConfigOption.
> >
> > If everyone agrees with this direction, we can start to refactor all
> > code that uses getXxx(String key, String defaultValue) into
> > getXxx(ConfigOption configOption), and completely
> > delete all getXxx(String key, String defaultValue) in 2.0.
> > And I'm willing to pick it up~
> >
> > To Xintong:
> >
> > > I think a `toMap` as suggested by Zhu and Xuannan should solve the
> > > problem. Alternatively, we may also consider providing an iterator for
> > the
> > > Configuration.
> >
> > Yeah, `toMap` can solve the problem, and I also mentioned it in the
> > initial mail.
> >
> > Also Providing an iterator is fine, but we don't have a strong
> > requirement for now. Simple is better, how about considering it
> > if we have other strong requirements in the future?
> >
> > Looking forwarding to your feedback, thanks~
> >
> > Best,
> > Rui
> >
> > On Thu, Dec 14, 2023 at 11:36 AM Xintong Song 
> > wrote:
> >
> >> I'm leaning towards not allowing string-key based configuration access
> in
> >> the long term.
> >>
> >> I see the Configuration being used in two different ways:
> >>
> >> 1. Writing / reading a specific option. In such cases, ConfigOption has
> >> many advantages compared to string-key, such as explicit value type,
> >> descriptions, default values, deprecated / fallback keys. I think we
> >> should
> >> encourage, and maybe even enforce, choosing ConfigOption over
> string-keys
> >> in such specific option access scenarios. That also applies to internal
> >> usages, for which the description may not be necessary because we won't
> >> generate documentation from it but we can still

Re: [DISCUSS] Allow TwoPhaseCommittingSink WithPreCommitTopology to alter the type of the Committable

2023-12-13 Thread Gyula Fóra

Thank you Peter Vary for updating the FLIP based on the discussions.
I really like the improvements introduced by the mixin interfaces which now
aligns much better with the source and table connectors.

While this introduces some breaking changes to the existing connectors,
this is a technical debt that we need to resolve as soon as possible and
fully before 2.0.

+1 from my side.

I am cc'ing some folks participating in the other threads, sorry about that
:)

Cheers,
Gyula

On Wed, Dec 13, 2023 at 4:14 PM Péter Váry 
wrote:

> I have updated the FLIP-372 [1] based on the comments from multiple
> sources. Moved to the mixin approach as this seems to be the consensus
> based on this thread [2]
> Also created a draft implementation [3] PR, so I can test the changes and
> default implementations (no new tests yet)
> Please provide your feedback, so I can address your questions, comments and
> then we can move forward to voting.
>
> Thanks,
> Peter
>
> [1] -
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-372%3A+Allow+TwoPhaseCommittingSink+WithPreCommitTopology+to+alter+the+type+of+the+Committable
> [2] - https://lists.apache.org/thread/h6nkgth838dlh5s90sd95zd6hlsxwx57
> [3] - https://github.com/apache/flink/pull/23912
>
> Péter Váry  ezt írta (időpont: 2023. dec.
> 11.,
> H, 14:28):
>
> > We identified another issue with the current Sink API in a thread [1]
> > related to FLIP-371 [2]. Currently it is not possible to evolve the
> > Sink.createWriter method with deprecation, because StatefulSink and
> > TwoPhaseCommittingSink has methods with the same name and parameters, but
> > narrowed down return type (StatefulSinkWriter, PrecommittingSinkWriter).
> >
> > To make the Sink API evolvable, we minimally have to remove these.
> >
> > The participants there also pointed out, that the Source API also uses
> > mixin interfaces (SupportsHandleExecutionAttemptSourceEvent,
> > SupportsIntermediateNoMoreSplits) in some cases. My observation is that
> it
> > has inheritance as well (ReaderOutput, ExternallyInducedSourceReader)
> >
> > I have created a draft API along these lines in a branch [3] where only
> > the last commit is relevant [4]. This implementation would follow the
> same
> > patterns as the current Source API.
> >
> > I see two different general approaches here, and I would like to hear
> your
> > preferences:
> > - Keep the changes minimal, stick to the current Sink API design. We
> > introduce the required new combination of interfaces
> > (TwoPhaseCommttingSinkWithPreCommitTopology,
> > WithPostCommitTopologyWithPreCommitTopology), and do not change the API
> > structure.
> >  - Pros:
> >   - Minimal change - smaller rewrite on the connector side
> >   - Type checks happen on compile time
> >  - Cons:
> >   - Harder to evolve
> >   - The number of interfaces increases with the possible
> > combinations
> >   - Inconsistent API patterns wrt Source API - harder for
> > developers to understand
> > - Migrate to a model similar to the Source API. We create mixin
> interfaces
> > for SupportsCommitter, SupportsWriterState, SupportsPreCommitTopology,
> > SupportsPostCommitTopology, SupportsPreWriteTopology.
> > - Pros:
> > - Better evolvability
> > - Consistent with the Source API
> > - Cons:
> > - The connectors need to change their inheritance patterns (after
> > the deprecation period) if they are using any of the more complicated
> Sinks.
> > - Type checks need to use `instanceof`, which could happen on DAG
> > generation time. Also, if the developer fails to correctly match the
> > generic types on the mixin interfaces, the error will only surface during
> > execution time - when the job tries to process the first record
> >
> > I personally prefer the Mixin approach for easier evolvability and better
> > consistency, but I would like to hear your thoughts, so I can flash out
> the
> > chosen approach in FLIP-372
> >
> > Thanks,
> > Peter
> >
> > [1] - https://lists.apache.org/thread/h6nkgth838dlh5s90sd95zd6hlsxwx57
> > [2] -
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-371%3A+Provide+initialization+context+for+Committer+creation+in+TwoPhaseCommittingSink
> > [3] - https://github.com/pvary/flink/tree/mixin_demo
> > [4] -
> >
> https://github.com/pvary/flink/commit/acfc09f4c846f983f633bbf0c902aea49aa6b156
> >
> >
> > On Fri, Nov 24, 2023, 11:38 Gyula Fóra  wrote:
> >
> >> Hi Peter!
> >>
> >

Re: Flink Kubernetes Operator Observed Generation

2023-12-12 Thread Gyula Fóra

Hi Ryan!

If you think this is useful/required to integrate with some tooling, I
don't really see a problem.
We already have a concept of reconciled generation as part of the
lastReconciledSpec status field but that is semantically a bit different.

Cheers,
Gyula

Re: [VOTE] FLIP-401: REST API JSON response deserialization unknown field tolerance

2023-12-11 Thread Gyula Fóra

+1

Gyula

On Mon, Dec 11, 2023 at 1:26 PM Gabor Somogyi 
wrote:

> Hi All,
>
> I'd like to start a vote on FLIP-401: REST API JSON response
> deserialization unknown field tolerance [1] which has been discussed in
> this thread [2].
>
> The vote will be open for at least 72 hours unless there is an objection or
> not enough votes.
>
> BR,
> G
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-401%3A+REST+API+JSON+response+deserialization+unknown+field+tolerance
> [2] https://lists.apache.org/thread/s52w9cf60d6s10bpzv9qjczpl6m394rz
>

Re: [DISCUSS] FLIP-401: REST API JSON response deserialization unknown field tolerance

2023-12-11 Thread Gyula Fóra

Thanks Gabor!

+1 from my side, this sounds like a reasonable change that will
improve integration and backward compatibility.

Let's keep this open for at least another day before starting the vote.

Cheers,
Gyula

On Thu, Dec 7, 2023 at 10:48 AM Gabor Somogyi 
wrote:

> Hi All,
>
> I'd like to start a discussion of FLIP-401: REST API JSON response
> deserialization unknown field tolerance[1].
>
> *Problem statement:*
>
> At the moment Flink is not ignoring unknown fields when parsing REST
> responses. An example for such a class is *JobDetailsInfo* but this applies
> to all others. It would be good to add this support to increase
> compatibility.
>
> The real life use-case is when the Flink k8s operator wants to handle 2
> jobs with 2 different Flink versions where the newer version has added a
> new field to any REST response. Such case the operator has basically 2
> options:
> * Use the old Flink version -> Such case exception comes because new field
> comes but it's not expected
> * Use the new Flink version -> Such case exception comes because new field
> is not coming but expected
>
> To hack around this issue it requires quite some ugly code parts in the
> operator.
>
> *Proposed solution:*
>
> Ignore all unknown fields in case of REST response JSON deserialization.
> Important to know that strict serialization would stay the same as-is.
>
> Actual object mapper configuration which is intended to be changed can be
> found here[2].
>
> Please share your opinion on this topic.
>
> BR,
> G
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-401%3A+REST+API+JSON+response+deserialization+unknown+field+tolerance
> [2]
>
> https://github.com/apache/flink/blob/3060ccd49cc8d19634b431dbf0f09ac875d0d422/flink-runtime/src/main/java/org/apache/flink/runtime/rest/util/RestMapperUtils.java#L31-L38
>

Re: Kubernetes Operator Managed Ingress with TLS

2023-12-08 Thread Gyula Fóra

Hi Ryan!

There is not really a Flink Kubernetes Operator team :) Everyone is welcome
to contribute and bring new improvement and feature ideas!

Sounds like a valuable addition, I will assign the ticket to you (please
comment on it so I can see your jira handle) so you can follow up with your
PR.
We are happy to help you with reviews, feedback etc.

Cheers,
Gyula


On Fri, Dec 8, 2023 at 4:38 PM Ryan van Huuksloot
 wrote:

> Hi,
>
> Would the Flink Kubernetes Operator team be open to adding TLS to the
> managed ingress? We have migrated to the operator but can't use the managed
> ingress because there is no TLS support. At the same time I'd like to add a
> label pass through mechanism, similar to the annotations.
>
> I've noticed there is an open issue, FLINK 33454
>  available but there
> has
> been no traction so I wanted to bring it up.
> Some minor modifications to the issue. I think that `tls:
> List` would be better to support the kubernetes spec. And
> `labels: Map` to support label passthrough.
>
> We have an internal clone of the operator with a version of this being
> tested
> <
> https://github.com/apache/flink-kubernetes-operator/compare/main...ryanvanhuuksloot:flink-kubernetes-operator:rvh.add-ingress-tls-spec
> >,
> it could be a good foundation for the open source.
>
> Happy to move the discussion to Jira.
>
> Ryan van Huuksloot
> Sr. Production Engineer | Streaming Platform
> [image: Shopify]
> 
>

Re: [DISCUSS] REST API response parsing throws exception on new fields

2023-12-07 Thread Gyula Fóra

You should start a new discussion thread according to the FLIP guidelines,
so that it's identifiable that this is a FLIP discussion.

Gyula

On Thu, Dec 7, 2023 at 10:22 AM Gabor Somogyi 
wrote:

> Gyula, thanks for sharing your opinion.
> I've created a small FLIP as asked:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-401%3A+REST+API+JSON+response+deserialization+unknown+field+tolerance
>
> Let's wait on voices until beginning of next week.
>
> G
>
>
> On Wed, Dec 6, 2023 at 11:05 AM Gyula Fóra  wrote:
>
> > Thanks G, I think this is a reasonable proposal which will increase
> > compatibility between different Flink clients and versions such as SQL
> > Gateway, CLI, Flink operator etc.
> >
> > I don't really see any harm in ignoring unknown json fields globally, but
> > this probably warrants a FLIP and a proper vote.
> >
> > Cheers,
> > Gyula
> >
> > On Wed, Dec 6, 2023 at 10:53 AM Gabor Somogyi  >
> > wrote:
> >
> > > Hi All,
> > >
> > > Since the possible solution can have effect on all the REST response
> > > deserialization I would like
> > > ask for opinions.
> > >
> > > *Problem statement:*
> > >
> > > At the moment Flink is not ignoring unknown fields when parsing REST
> > > responses. An example for such a class is JobDetailsInfo but this
> applies
> > > to all others. It would be good to add this support to increase
> > > compatibility.
> > >
> > > The real life use-case is when the Flink k8s operator wants to handle 2
> > > jobs with 2 different Flink versions where the newer version has added
> a
> > > new field to any REST response. Such case the operator has basically 2
> > > options:
> > > * Use the old Flink version -> Such case exception comes because new
> > field
> > > comes but it's not expected
> > > * Use the new Flink version -> Such case exception comes because new
> > field
> > > is not coming but expected
> > >
> > > To hack around this issue it requires quite some ugly code parts in the
> > > operator.
> > >
> > > The mentioned issue is tracked here:
> > > https://issues.apache.org/jira/browse/FLINK-33268
> > >
> > > *Proposed solution:*
> > >
> > > Ignore all unknown fields in case of REST response JSON
> deserialization.
> > > Important to know that strict serialization would stay the same as-is.
> > >
> > > Actual object mapper configuration can be found here:
> > >
> > >
> >
> https://github.com/apache/flink/blob/3060ccd49cc8d19634b431dbf0f09ac875d0d422/flink-runtime/src/main/java/org/apache/flink/runtime/rest/util/RestMapperUtils.java#L31-L38
> > >
> > > Please share your opinion on this topic.
> > >
> > > BR,
> > > G
> > >
> >
>

Re: [DISCUSSION] Consider Flink operator having a way to monitor the status of bounded streaming jobs after they finish or error?

2023-12-07 Thread Gyula Fóra

This config has nothing to do with the operator (it's a core flink feature)
and is not an issue after Flink 1.15.
Newer operator versions (1.7+) drop support for Flink 1.13 and 1.14 as it's
not feasible to maintain too many legacy codepaths.

The only solution for you is to update your Flink versions, you are missing
out on so many improvements.

Gyula

On Thu, Dec 7, 2023 at 9:32 AM richard.su  wrote:

> Hi Gyula, Flink version is 1.14
> Our flink version is hard to upgrade since we have some user in our
> platform.
> sorry I have not noticed this configuration, it's confusing because flink
> operator announced support from 1.13 to 1.17/1.18
>
> Has other solution will work in our situation?
>
> Thanks
> Richard Su
>
> > 2023年12月7日 16:22，Gyula Fóra  写道：
> >
> > Hi!
> >
> > What Flink version are you using?
> > The operator always sets: execution.shutdown-on-application-finish to
> false
> > so that finished / failed application clusters should not exit
> immediately
> > and we can observe them.
> >
> > This is however only available in Flink 1.15 and above.
> >
> > Cheers,
> > Gyula
> >
> > On Thu, Dec 7, 2023 at 9:15 AM richard.su 
> wrote:
> >
> >> Hi, Community, I had found out this issue, but I'm not sure this issue
> >> have any solution. I have tried flink operator 1.6, which this issue is
> >> still exist.
> >>
> >> If not, I think this could create a jira issue to following.
> >>
> >> When we create a bounded streaming jobs which will finally to become
> >> Finished status, after this job's status from Running to Finished, flink
> >> will shut down kubernetes cluster, at code of flink-kubernetes package,
> >> class KubernetesResourceManagerDriver's method deregisterApplication,
> which
> >> will delete jm deployment directly in a second (in our env).
> >> But our operator config, when jm deployment status is Ready and not in
> >> savepoint progress, this observer interval is 15s, which means operator
> >> will never observe the job status changing.
> >> So if the job is failed not finished, we cannot distinguish this. All we
> >> known is Jm deployment is Missing and Job status is Reconciling.
> >> We want to using flink operator integrating into our platform, but it
> >> cannot monitor job real status, which is wired.
> >>
> >> May be it till related to the clean logic of flink native mode, from my
> >> side, operator side is hard to deal with such situation because we
> cannot
> >> directly get the exit code of container when pod is missing and jm
> >> deployment is missing.
> >>
> >> Thanks to your time to read this issue.
> >> Richard Su
> >>>
> >>> 2023年12月6日 13:34，richard.su  写道：
> >>>
> >>> For more information to produce this problem,
> >>>
> >>> version: flink operator 1.4
> >>> mode: native
> >>> job: wordcount
> >>> language: java
> >>> type: FlinkDeployment
> >>>
> >>>> 2023年12月6日 10:52，richard.su  写道：
> >>>>
> >>>> Hi Community, the default configuration of flink operator is:
> >>>>
> >>>> kubernetes.operator.reconcile.interval: 15s
> >>>> kubernetes.operator.observer.progress-check.interval: 5s
> >>>>
> >>>> when a bounded streaming job already stays in stop or error status, jm
> >> deployment will stay to be missing, if I set configuration:
> >>>>
> >>>> kubernetes.operator.jm-deployment-recover.enabled: false
> >>>>
> >>>> then, flink operator can only observe the job status at Recociling and
> >> jm deployment status at Missing
> >>>>
> >>>> we cannot check whether the flink job is  finished or error, because
> of
> >> in the interval of observer.progress-check, flink web ui is already
> down.
> >>>>
> >>>> so, we hope someone in community could show a way to monitor bounded
> >> steaming job's status.
> >>>>
> >>>> Thanks.
> >>>>
> >>>> Richard Su
> >>>
> >>
> >>
>
>

Re: [DISCUSSION] Consider Flink operator having a way to monitor the status of bounded streaming jobs after they finish or error?

2023-12-07 Thread Gyula Fóra

Hi!

What Flink version are you using?
The operator always sets: execution.shutdown-on-application-finish to false
so that finished / failed application clusters should not exit immediately
and we can observe them.

This is however only available in Flink 1.15 and above.

Cheers,
Gyula

On Thu, Dec 7, 2023 at 9:15 AM richard.su  wrote:

> Hi, Community, I had found out this issue, but I'm not sure this issue
> have any solution. I have tried flink operator 1.6, which this issue is
> still exist.
>
> If not, I think this could create a jira issue to following.
>
> When we create a bounded streaming jobs which will finally to become
> Finished status, after this job's status from Running to Finished, flink
> will shut down kubernetes cluster, at code of flink-kubernetes package,
> class KubernetesResourceManagerDriver's method deregisterApplication, which
> will delete jm deployment directly in a second (in our env).
> But our operator config, when jm deployment status is Ready and not in
> savepoint progress, this observer interval is 15s, which means operator
> will never observe the job status changing.
> So if the job is failed not finished, we cannot distinguish this. All we
> known is Jm deployment is Missing and Job status is Reconciling.
> We want to using flink operator integrating into our platform, but it
> cannot monitor job real status, which is wired.
>
> May be it till related to the clean logic of flink native mode, from my
> side, operator side is hard to deal with such situation because we cannot
> directly get the exit code of container when pod is missing and jm
> deployment is missing.
>
> Thanks to your time to read this issue.
> Richard Su
> >
> > 2023年12月6日 13:34，richard.su  写道：
> >
> > For more information to produce this problem,
> >
> > version: flink operator 1.4
> > mode: native
> > job: wordcount
> > language: java
> > type: FlinkDeployment
> >
> >> 2023年12月6日 10:52，richard.su  写道：
> >>
> >> Hi Community, the default configuration of flink operator is:
> >>
> >> kubernetes.operator.reconcile.interval: 15s
> >> kubernetes.operator.observer.progress-check.interval: 5s
> >>
> >> when a bounded streaming job already stays in stop or error status, jm
> deployment will stay to be missing, if I set configuration:
> >>
> >> kubernetes.operator.jm-deployment-recover.enabled: false
> >>
> >> then, flink operator can only observe the job status at Recociling and
> jm deployment status at Missing
> >>
> >> we cannot check whether the flink job is  finished or error, because of
> in the interval of observer.progress-check, flink web ui is already down.
> >>
> >> so, we hope someone in community could show a way to monitor bounded
> steaming job's status.
> >>
> >> Thanks.
> >>
> >> Richard Su
> >
>
>

Re: Flink Operator - Supporting Recovery from Snapshot

2023-12-06 Thread Gyula Fóra

Hi All!

Based on some continuous feedback and experience, we feel that it may be a
good time to introduce this functionality in a way that doesn't
accidentally affect existing users in an unexpected way.

Please see: https://issues.apache.org/jira/browse/FLINK-33763 for details
and review.

Cheers,
Gyula

On Fri, Feb 10, 2023 at 7:27 PM Kevin Lam 
wrote:

> Hey Yaroslav!
>
> Awesome, good to know that approach works well for you. I think our plan as
> of now is to do the same--delete the current FlinkDeployment when deploying
> from a specific snapshot. It'll be a separate workflow from normal
> deployments to take advantage of the operator otherwise.
>
> Thanks!
>
> On Fri, Feb 10, 2023 at 12:23 PM Yaroslav Tkachenko
>  wrote:
>
> > Hi Kevin!
> >
> > In my case, I automated this workflow by first deleting the current Flink
> > deployment and then creating a new one. So, if the initialSavepointPath
> is
> > different it'll use it for recovery.
> >
> > This approach is indeed irreversible, but so far it's been working well.
> >
> > On Fri, Feb 10, 2023 at 8:17 AM Kevin Lam  >
> > wrote:
> >
> > > Thanks for the response Gyula! Those caveats make sense, and I see,
> > there's
> > > a bit of a complexity to consider if the feature is implemented. I do
> > think
> > > it would be useful, so would also love to hear what others think!
> > >
> > > On Wed, Feb 8, 2023 at 3:47 AM Gyula Fóra 
> wrote:
> > >
> > > > Hi Kevin!
> > > >
> > > > Thanks for starting this discussion.
> > > >
> > > > On a high level what you are proposing is quite simple: if the
> initial
> > > > savepoint path changes we use that for the upgrade.
> > > >
> > > > I see a few caveats here that may be important:
> > > >
> > > >  1. To use a new savepoint/checkpoint path for recovery we have to
> stop
> > > the
> > > > job and delete all HA metadata. This means that this operation may
> not
> > be
> > > > "reversible" in some cases because we lose the checkpoint info with
> the
> > > HA
> > > > metadata (unless we force a savepoint on shutdown).
> > > >  2. This will break the current upgrade/checkpoint ownership model in
> > > which
> > > > the operator controls the checkpoints and ensures that you always get
> > the
> > > > latest (or an error). It will also make the reconciliation logic more
> > > > complex
> > > >  3. This could be a breaking change for current users (if for some
> > reason
> > > > they rely on the current behaviour, which is weird but still true)
> > > >  4. The name initialSavepointPath becomes a bit misleading
> > > >
> > > > I agree that it would be nice to make this easier for the user, but
> the
> > > > question is whether what we gain by this is worth the extra
> complexity.
> > > > I think under normal circumstances the user does not really want to
> > > > suddenly redeploy the job starting from a new state. If that happens
> I
> > > > think it makes sense to create a new deployment resource and it's
> not a
> > > > very big overhead.
> > > >
> > > > Currently when "manual" recovery is needed are cases when the
> operator
> > > > loses track of the latest checkpoint, mostly due to "incorrect" error
> > > > handling on the Flink side that also deletes the HA metadata. I think
> > we
> > > > should strive to improve and eliminate most of these cases (as we
> have
> > > > already done for many of these problems).
> > > >
> > > > Would be great to hear what others think about this topic!
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam
> >  > > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I was reading the Flink Kubernetes Operator documentation and
> noticed
> > > > that
> > > > > if you want to redeploy a Flink job from a specific snapshot, you
> > must
> > > > > follow these manual recovery steps. Are there plans to streamline
> > this
> > > > > process? Deploying from a specific snapshot is a relatively common
> > > > > operation and it'd be nice to not need to delete the
> FlinkDeployment
> > > > >
> > > > > I wonder if the Flink Operator could use the initialSavepointPath
> > > similar
> > > > > to the restartNonce and savepointTriggerNonce parameters, where if
> > > > > initialSavepointPath changes, the deployed job is restored from the
> > > > > specified savepoint. Any thoughts?
> > > > >
> > > > > Thanks!
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] REST API response parsing throws exception on new fields

2023-12-06 Thread Gyula Fóra

Thanks G, I think this is a reasonable proposal which will increase
compatibility between different Flink clients and versions such as SQL
Gateway, CLI, Flink operator etc.

I don't really see any harm in ignoring unknown json fields globally, but
this probably warrants a FLIP and a proper vote.

Cheers,
Gyula

On Wed, Dec 6, 2023 at 10:53 AM Gabor Somogyi 
wrote:

> Hi All,
>
> Since the possible solution can have effect on all the REST response
> deserialization I would like
> ask for opinions.
>
> *Problem statement:*
>
> At the moment Flink is not ignoring unknown fields when parsing REST
> responses. An example for such a class is JobDetailsInfo but this applies
> to all others. It would be good to add this support to increase
> compatibility.
>
> The real life use-case is when the Flink k8s operator wants to handle 2
> jobs with 2 different Flink versions where the newer version has added a
> new field to any REST response. Such case the operator has basically 2
> options:
> * Use the old Flink version -> Such case exception comes because new field
> comes but it's not expected
> * Use the new Flink version -> Such case exception comes because new field
> is not coming but expected
>
> To hack around this issue it requires quite some ugly code parts in the
> operator.
>
> The mentioned issue is tracked here:
> https://issues.apache.org/jira/browse/FLINK-33268
>
> *Proposed solution:*
>
> Ignore all unknown fields in case of REST response JSON deserialization.
> Important to know that strict serialization would stay the same as-is.
>
> Actual object mapper configuration can be found here:
>
> https://github.com/apache/flink/blob/3060ccd49cc8d19634b431dbf0f09ac875d0d422/flink-runtime/src/main/java/org/apache/flink/runtime/rest/util/RestMapperUtils.java#L31-L38
>
> Please share your opinion on this topic.
>
> BR,
> G
>

Re: Discussion: [FLINK-24150] Support to configure cpu resource request and limit in pod template

2023-12-05 Thread Gyula Fóra

I understand your problem but I think you are trying to find a solution in
the wrong place.
Have you tried setting taskmanager.memory.jvm-overhead.fraction ? That
would reserve more memory from the total process memory for non-JVM use.

Gyula

On Tue, Dec 5, 2023 at 1:50 PM richard.su  wrote:

> Sorry, "To be clear, we need a container has memory larger than request,
> and confirm this pod has Guarantee Qos." which need to be "To be clear, we
> need a container has memory larger than process.size, and confirm this pod
> has Guarantee Qos."
>
> Thanks.
>
> Richard Su
>
>
> > 2023年12月5日 20:47，richard.su  写道：
> >
> > Hi, Gyula, yes, this is a special case in our scenarios, sorry about
> that it's hard to understand,  which we want to reserved some memory beyond
> the jobmanager or task manager's process.To be clear, we need a container
> has memory larger than request, and confirm this pod has Guarantee Qos.
> >
> > This is because we encounter the glibc problem inside container with
> flink job using Rcoksdb, which reserved memory will help to ease this
> problem.
> >
> > So I hope the container resources's request can be decoupling from flink
> configuration.
> >
> > From flink's current implementation, this could not be done.
> >
> > Thanks.
> >
> > Richard Su
> >
> >> 2023年12月5日 20:28，Gyula Fóra  写道：
> >>
> >> Richard, I still don't understand why the current setup doesn't work for
> >> you. According to
> >>
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/memory/mem_setup/
> >> :
> >>
> >> The process memory config (which is what we configure) translates
> directly
> >> into the container request size. With the new proposal you can set the
> >> limit independently.
> >>
> >> What you write doesn't make sense to me:
> >> "user wants to define a flinkdeployment with jobmanager has 1G memory
> >> resources in container field but config jobmanager.memory.process.size
> as
> >> 850m"
> >>
> >> If you want to have a 1G container you set the memory request
> >> (process.size) in the spec simply  to 1G. Then you have 1G, there are
> other
> >> configs on how this 1G will be split inside the container for various
> >> purposes but these are all covered in detail by the flink memory
> configs.
> >>
> >> Cheers
> >> Gyula
> >>
> >> On Tue, Dec 5, 2023 at 1:06 PM richard.su 
> wrote:
> >>
> >>> I think the new configuration could be :
> >>>
> >>> "kubernetes.taskmanager.memory.amount" and
> >>> "kubernetes.jobmanager.memory.amout"
> >>>
> >>> once we can calculate the limit-factor by the different of requests and
> >>> limits.
> >>>
> >>> when native mode, we no longer check the process.size as default
> memory,
> >>> but using this configuration for decoupling logic.
> >>>
> >>> Thanks
> >>>
> >>> Richard Su
> >>>
> >>>> 2023年12月5日 19:22，richard.su  写道：
> >>>>
> >>>> Hi, Gyula, from my opinion, this still will using flinkDeployment's
> >>> resource filed to set jobManager.memory.process.size, and I have told
> an
> >>> uncovered case that:
> >>>>
> >>>> When user wants to define a flinkdeployment with jobmanager has 1G
> >>> memory resources in container field but config
> >>> jobmanager.memory.process.size as 850m, which this solution only
> improves
> >>> user config and actually make sconfig more intuitive and easier but not
> >>> make the container resource decoupling flink configuration.
> >>>>
> >>>> So from my side, I think it need to add new configuration to support
> >>> this proposal, and it need more discussion.
> >>>>
> >>>> Thanks
> >>>> Chaoran Su
> >>>>
> >>>>
> >>>>> 2023年12月5日 18:28，Gyula Fóra  写道：
> >>>>>
> >>>>> This is the proposal according to FLINK-33548:
> >>>>>
> >>>>> spec:
> >>>>> taskManager:
> >>>>> resources:
> >>>>>   requests:
> >>>>> memory: "64Mi"
> >>>>> cpu: "250m"
> >>>>>   limits:
> >>>>> memory: "128Mi"
> >>>>

Re: Discussion: [FLINK-24150] Support to configure cpu resource request and limit in pod template

2023-12-05 Thread Gyula Fóra

Richard, I still don't understand why the current setup doesn't work for
you. According to
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/memory/mem_setup/
:

The process memory config (which is what we configure) translates directly
into the container request size. With the new proposal you can set the
limit independently.

What you write doesn't make sense to me:
"user wants to define a flinkdeployment with jobmanager has 1G memory
resources in container field but config jobmanager.memory.process.size as
850m"

If you want to have a 1G container you set the memory request
(process.size) in the spec simply  to 1G. Then you have 1G, there are other
configs on how this 1G will be split inside the container for various
purposes but these are all covered in detail by the flink memory configs.

Cheers
Gyula

On Tue, Dec 5, 2023 at 1:06 PM richard.su  wrote:

> I think the new configuration could be :
>
> "kubernetes.taskmanager.memory.amount" and
> "kubernetes.jobmanager.memory.amout"
>
> once we can calculate the limit-factor by the different of requests and
> limits.
>
> when native mode, we no longer check the process.size as default memory,
> but using this configuration for decoupling logic.
>
> Thanks
>
> Richard Su
>
> > 2023年12月5日 19:22，richard.su  写道：
> >
> > Hi, Gyula, from my opinion, this still will using flinkDeployment's
> resource filed to set jobManager.memory.process.size, and I have told an
> uncovered case that:
> >
> > When user wants to define a flinkdeployment with jobmanager has 1G
> memory resources in container field but config
> jobmanager.memory.process.size as 850m, which this solution only improves
> user config and actually make sconfig more intuitive and easier but not
> make the container resource decoupling flink configuration.
> >
> > So from my side, I think it need to add new configuration to support
> this proposal, and it need more discussion.
> >
> > Thanks
> > Chaoran Su
> >
> >
> >> 2023年12月5日 18:28，Gyula Fóra  写道：
> >>
> >> This is the proposal according to FLINK-33548:
> >>
> >> spec:
> >> taskManager:
> >>   resources:
> >> requests:
> >>   memory: "64Mi"
> >>   cpu: "250m"
> >> limits:
> >>   memory: "128Mi"
> >>   cpu: "500m"
> >>
> >> I honestly think this is much more intuitive and easier than using the
> >> podTemplate, which is very complex immediately.
> >> Please tell me what use-case/setup is not covered by this improved spec.
> >>
> >> Unless there is a big limitation here I am still -1 for modifying the
> >> podTemplate logic and +1 for continuing with FLINK-33548
> >>
> >> Gyula
> >>
> >>
> >>
> >> On Tue, Dec 5, 2023 at 11:16 AM Surendra Singh Lilhore <
> >> surendralilh...@gmail.com> wrote:
> >>
> >>> Hi Gyula,
> >>>
> >>> FLINK-33548 proposes adding a new resource field to match with
> Kubernetes
> >>> pod resource configuration. Here's my suggestion: instead of adding a
> new
> >>> resource field, let's use a pod template for more advanced resource
> setup.
> >>> Adding a new resource field might confuse users. This change can also
> help
> >>> with issues when users use Flink Kubernetes commands directly, without
> the
> >>> operator.
> >>>
> >>> Thanks
> >>> Surendra
> >>>
> >>>
> >>> On Tue, Dec 5, 2023 at 3:10 PM richard.su 
> wrote:
> >>>
> >>>> Sorry Gyula,  let me explain more about the point of 2, if I avoid the
> >>>> override, I will got a jobmanager pod still with resources consist
> with
> >>>> “jobmanager.memory.process.size”, but a flinkdeployment with a
> resource
> >>>> larger than that.
> >>>>
> >>>> Thanks for your time.
> >>>> Richard Su
> >>>>
> >>>>> 2023年12月5日 17:13，richard.su  写道：
> >>>>>
> >>>>> Thank you for your time, Gyula, I have more question about
> Flink-33548,
> >>>> we can have more discussion about this and make progress:
> >>>>>
> >>>>> 1. I agree with you about declaring resources in FlinkDeployment
> >>>> resource sections. But Flink Operator will override the
> >>>> “jobmanager.memory.process.size”  and
> "taskmanager.memory.process.

Re: Discussion: [FLINK-24150] Support to configure cpu resource request and limit in pod template

2023-12-05 Thread Gyula Fóra

This is the proposal according to FLINK-33548:

spec:
  taskManager:
resources:
  requests:
memory: "64Mi"
cpu: "250m"
  limits:
memory: "128Mi"
cpu: "500m"

I honestly think this is much more intuitive and easier than using the
podTemplate, which is very complex immediately.
Please tell me what use-case/setup is not covered by this improved spec.

Unless there is a big limitation here I am still -1 for modifying the
podTemplate logic and +1 for continuing with FLINK-33548

Gyula



On Tue, Dec 5, 2023 at 11:16 AM Surendra Singh Lilhore <
surendralilh...@gmail.com> wrote:

> Hi Gyula,
>
> FLINK-33548 proposes adding a new resource field to match with Kubernetes
> pod resource configuration. Here's my suggestion: instead of adding a new
> resource field, let's use a pod template for more advanced resource setup.
> Adding a new resource field might confuse users. This change can also help
> with issues when users use Flink Kubernetes commands directly, without the
> operator.
>
> Thanks
> Surendra
>
>
> On Tue, Dec 5, 2023 at 3:10 PM richard.su  wrote:
>
> > Sorry Gyula,  let me explain more about the point of 2, if I avoid the
> > override, I will got a jobmanager pod still with resources consist with
> > “jobmanager.memory.process.size”, but a flinkdeployment with a resource
> > larger than that.
> >
> > Thanks for your time.
> > Richard Su
> >
> > > 2023年12月5日 17:13，richard.su  写道：
> > >
> > > Thank you for your time, Gyula, I have more question about Flink-33548,
> > we can have more discussion about this and make progress:
> > >
> > > 1. I agree with you about declaring resources in FlinkDeployment
> > resource sections. But Flink Operator will override the
> > “jobmanager.memory.process.size”  and "taskmanager.memory.process.size",
> > despite I have set these configuration or not in flink configuration. If
> > user had configured all memory attributes, the override will leads to
> error
> > as the overall computation is error.
> > >
> > > the code of override is in FlinkConfigManager.class in buildFrom
> method,
> > which apply to JobmanagerSpec and TaskManagerSpec.
> > >
> > > 2. If I modified the code of override, I will still encounter this
> issue
> > of FLINK-24150, because I only modified the code of flink operator but
> not
> > flink-kubernetes package, so I will make a pod resources like (cpu:1c
> > memory:1g) and container resource to be (cpu:1c, memory 850m), because I
> > already set jobmanager.memory.process.size to 850m.
> > >
> > > 3. because of there two point, we need to make the podTemplate have
> > higher priority. Otherwise we can refactor the code of flink operator,
> > which should import something new configuration to support the native
> mode.
> > >
> > > I think it will be better to import some configuration, which
> > FlinkConfigManager.class can override it using the resource of
> > JobmanagerSpec and TaskManagerSpec.
> > >
> > > When it deep into the code flink-kubernetes package, we using these new
> > configuration as the final result of containers resources.
> > >
> > > Thanks for your time.
> > > Richard Su
> > >
> > >> 2023年12月5日 16:45，Gyula Fóra  写道：
> > >>
> > >> As you can see in the jira ticket there hasn't been any progress,
> nobody
> > >> started to work on this yet.
> > >>
> > >> I personally don't think it's confusing to declare resources in the
> > >> FlinkDeployment resource sections. It's well documented and worked
> very
> > >> well so far for most users.
> > >> This is pretty common practice for kubernetes.
> > >>
> > >> Cheers,
> > >> Gyula
> > >>
> > >> On Tue, Dec 5, 2023 at 9:35 AM richard.su 
> > wrote:
> > >>
> > >>> Hi, Gyula, is there had any progress in FLINK-33548? I would like to
> > join
> > >>> the discussion but I haven't seen any discussion in the url.
> > >>>
> > >>> I also make flinkdeployment by flink operator, which indeed will
> > override
> > >>> the process size by TaskmanagerSpec.resources or
> > JobmanagerSpec.resources,
> > >>> which really confused, I had modified the code of flink operator to
> > avoid
> > >>> the override.
> > >>>
> > >>> Looking for your response.
> > >>>
> > >>> Than

Re: Discussion: [FLINK-24150] Support to configure cpu resource request and limit in pod template

2023-12-05 Thread Gyula Fóra

As you can see in the jira ticket there hasn't been any progress, nobody
started to work on this yet.

I personally don't think it's confusing to declare resources in the
FlinkDeployment resource sections. It's well documented and worked very
well so far for most users.
This is pretty common practice for kubernetes.

Cheers,
Gyula

On Tue, Dec 5, 2023 at 9:35 AM richard.su  wrote:

> Hi, Gyula, is there had any progress in FLINK-33548? I would like to join
> the discussion but I haven't seen any discussion in the url.
>
> I also make flinkdeployment by flink operator, which indeed will override
> the process size by TaskmanagerSpec.resources or JobmanagerSpec.resources,
> which really confused, I had modified the code of flink operator to avoid
> the override.
>
> Looking for your response.
>
> Thank you.
> Richard Su
>
>
> > 2023年12月5日 16:22，Gyula Fóra  写道：
> >
> > Hi!
> >
> > Please see the discussion in
> > https://lists.apache.org/thread/6p5tk6obmk1qxf169so498z4vk8cg969
> > and the ticket: https://issues.apache.org/jira/browse/FLINK-33548
> >
> > We should follow the approach outlined there. If you are interested you
> are
> > welcome to pick up the operator ticket.
> >
> > Unfortunately your PR can be a large unexpected change to existing users
> so
> > we should not add it.
> >
> > Cheers,
> > Gyula
> >
> > On Tue, Dec 5, 2023 at 9:05 AM 苏超腾  wrote:
> >
> >> Hello everyone,
> >>
> >> I've encountered an issue while using flink kubernetes native, Despite
> >> setting resource limits in the pod template, it appears that these
> limits
> >> and requests are not considered during JobManager(JM) and TaskManager
> (TM)
> >> pod deployment.
> >>
> >> I find the a issue had opened in jira  FLINK-24150, which introduced
> >> almost the same questions that I encountered.
> >>
> >> I agrees that if user had provided pod templates, we should put priority
> >> on it higher than flink calculated from configuration.
> >>
> >> But this need some discussion in our community, because it related some
> >> scenarios:
> >> If I want to create a pod with Graranted QoS and want the memory of the
> >> Flink main container to be larger than the process size of Flink, I
> cannot
> >> directly modify podTemplate (although we can use limit factor, this will
> >> cause the QoS to change from Graranted to Burstable)
> >> If I want to create a pod with Burstable QoS, I don't want to use limit
> >> actor and want to directly configure the request to be 50% of the limit,
> >> which cannot be modified.
> >> In order to meet these scenarios, I had committed a pull request
> >> https://github.com/apache/flink/pull/23872
> >>
> >> This code is very simple and just need someone to review, this pr can be
> >> cherry pick to other old version, which will be helpful.
> >>
> >>
> >> I would appreciate any feedback on this.
> >>
> >> Thank you for your time and contributions to the Flink project.
> >>
> >> Thank you,
> >> chaoran.su
>
>

Re: Discussion: [FLINK-24150] Support to configure cpu resource request and limit in pod template

2023-12-05 Thread Gyula Fóra

Hi!

Please see the discussion in
https://lists.apache.org/thread/6p5tk6obmk1qxf169so498z4vk8cg969
and the ticket: https://issues.apache.org/jira/browse/FLINK-33548

We should follow the approach outlined there. If you are interested you are
welcome to pick up the operator ticket.

Unfortunately your PR can be a large unexpected change to existing users so
we should not add it.

Cheers,
Gyula

On Tue, Dec 5, 2023 at 9:05 AM 苏超腾  wrote:

> Hello everyone,
>
> I've encountered an issue while using flink kubernetes native, Despite
> setting resource limits in the pod template, it appears that these limits
> and requests are not considered during JobManager(JM) and TaskManager (TM)
> pod deployment.
>
> I find the a issue had opened in jira  FLINK-24150, which introduced
> almost the same questions that I encountered.
>
> I agrees that if user had provided pod templates, we should put priority
> on it higher than flink calculated from configuration.
>
> But this need some discussion in our community, because it related some
> scenarios:
> If I want to create a pod with Graranted QoS and want the memory of the
> Flink main container to be larger than the process size of Flink, I cannot
> directly modify podTemplate (although we can use limit factor, this will
> cause the QoS to change from Graranted to Burstable)
> If I want to create a pod with Burstable QoS, I don't want to use limit
> actor and want to directly configure the request to be 50% of the limit,
> which cannot be modified.
> In order to meet these scenarios, I had committed a pull request
> https://github.com/apache/flink/pull/23872
>
> This code is very simple and just need someone to review, this pr can be
> cherry pick to other old version, which will be helpful.
>
>
> I would appreciate any feedback on this.
>
> Thank you for your time and contributions to the Flink project.
>
> Thank you,
> chaoran.su

Re: [DISCUSS] Resolve diamond inheritance of Sink.createWriter

2023-12-04 Thread Gyula Fóra

Hi All!

Based on the discussion above, I feel that the most reasonable approach
from both developers and users perspective at this point is what Becket
lists as Option 1:

Revert the naming change to the backward compatible version and accept that
the names are not perfect (treat it as legacy).

On a different note, I agree that the current sink v2 interface is very
difficult to evolve and structuring the interfaces the way they are now is
not a good design in the long run.
For new functionality or changes we can make easily, we should switch to
the decorative/mixin interface approach used successfully in the source and
table interfaces. Let's try to do this as much as possible within the v2
and compatibility boundaries and we should only introduce a v3 if we really
must.

So from my side, +1 to reverting the naming to keep backward compatibility.

Cheers,
Gyula


On Fri, Dec 1, 2023 at 10:43 AM Péter Váry 
wrote:

> Thanks Becket for your reply!
>
> *On Option 1:*
> - I personally consider API inconsistencies more important, since they will
> remain with us "forever", but this is up to the community. I can implement
> whichever solution we decide upon.
>
> *Option 2:*
> - I don't think this specific issue merits a rewrite, but if we decide to
> change our approach, then it's a different story.
>
> *Evolvability:*
> This discussion reminds me of a similar discussion on FLIP-372 [1], where
> we are trying to decide if we should use mixin interfaces, or use interface
> inheritance.
> With the mixin approach, we have a more flexible interface, but we can't
> check the generic types of the interfaces/classes on compile time, or even
> when we create the DAG. The first issue happens when we call the method and
> fail.
> The issue here is similar:
> - *StatefulSink* needs a writer with a method to `*snapshotState*`
> - *TwoPhaseCommittingSink* needs a writer with `*prepareCommit*`
> - If there is a Sink which is stateful and needs to commit, then it needs
> both of these methods.
>
> If we use the mixin solution here, we lose the possibility to check the
> types in compile time. We could do the type check in runtime using `
> *instanceof*`, so we are better off than with the FLIP-372 example above,
> but still lose any important possibility. I personally prefer the mixin
> approach, but that would mean we rewrite the Sink API again - likely a
> SinkV3. Are we ready to move down that path?
>
> Thanks,
> Peter
>
> [1] - https://lists.apache.org/thread/344pzbrqbbb4w0sfj67km25msp7hxlyd
>
> On Thu, Nov 30, 2023, 14:53 Becket Qin  wrote:
>
> > Hi folks,
> >
> > Sorry for replying late on the thread.
> >
> > For this particular FLIP, I see two solutions:
> >
> > Option 1:
> > 1. On top of the the current status, rename
> > *org.apache.flink.api.connector.sink2.InitContext *to
> > *CommonInitContext (*should
> > probably be package private*)*.
> > 2. Change the name *WriterInitContext* back to *InitContext, *and revert
> > the deprecation. We can change the parameter name to writerContext if we
> > want to.
> > Admittedly, this does not have full symmetric naming of the InitContexts
> -
> > we will have CommonInitContext / InitContext / CommitterInitContext
> instead
> > of InitContext / WriterInitContext / CommitterInitContext. However, the
> > naming seems clear without much confusion. Personally, I can live with
> > that, treating the class InitContext as a non-ideal legacy class name
> > without much material harm.
> >
> > Option 2:
> > Theoretically speaking, if we really want to reach the perfect state
> while
> > being backwards compatible, we can create a brand new set of Sink
> > interfaces and deprecate the old ones. But I feel this is an overkill
> here.
> >
> > The solution to this particular issue aside, the evolvability of the
> > current interface hierarchy seems a more fundamental issue and worries me
> > more. I haven't completely thought it through, but there are two
> noticeable
> > differences between the interface design principles between Source and
> > Sink.
> > 1. Source uses decorative interfaces. For example, we have a
> > SupportsFilterPushdown interface, instead of a subclass of
> > FilterableSource. This seems provides better flexibility.
> > 2. Source tends to have a more coarse-grained interface. For example,
> > SourceReader always has the methods of snapshotState(),
> > notifyCheckpointComplete(). Even if they may not be always required, we
> do
> > not separate them into different interfaces.
> > My hunch is that if we follow similar approach as Source, the
> evolvability
> > might be better. If we want to do this, we'd better to do it before 2.0.
> > What do you think?
> >
> > Process wise,
> > - I agree that if there is a change to the passed FLIP during
> > implementation, it should be brought back to the mailing list.
> > - There might be value for the connector nightly build to depend on the
> > latest snapshot of the same Flink major version. It helps catching
> > unexpected

Re: Discussion: [FLINK-33609] Take into account the resource limit specified in the pod template

2023-11-24 Thread Gyula Fóra

Hi

In that case we could also close your ticket and the one you linked and
refer to this one as the selected approach :)

Gyula

On Fri, 24 Nov 2023 at 20:19, Surendra Singh Lilhore <
surendralilh...@gmail.com> wrote:

> Hi Gyula,
>
>
> Yes, I agree with this approach.
>
>
> Thanks
> Surendra
>
>
> On Sat, Nov 25, 2023, 12:28 AM Gyula Fóra  wrote:
>
> > I think a better and more Kubernetes canonical approach would be what is
> > outlined in this ticket:
> >
> > https://issues.apache.org/jira/browse/FLINK-33548
> >
> > The upside is that it will work for all operator supported Flink versions
> > and is much simpler to use than the pod template.
> >
> > What do you think?
> >
> > Gyula
> >
> > On Fri, 24 Nov 2023 at 19:50, Surendra Singh Lilhore <
> > surendralilh...@apache.org> wrote:
> >
> > > Hi Gyula,
> > >
> > > Thank you for raising these valid concerns and sharing your
> perspective.
> > I
> > > agree that this new change will impact the existing pipelines.
> > >
> > > I brought up this issue because in Kubernetes environments, resource
> > > configuration is typically managed through Kubernetes objects like
> > > container templates or Custom Resources (CR). To align with this
> > > established practice in Kubernetes, I was planning to utilize pod
> > templates
> > > in the FlinkDeployment
> > > <
> > >
> >
> https://github.com/apache/flink-kubernetes-operator/blob/main/helm/flink-kubernetes-operator/crds/flinkdeployments.flink.apache.org-v1.yml#L61
> > > >
> > > CR instead of exclusively relying on Flink-specific configuration.
> > >
> > >
> > > There is similar type of issue raised : FLINK-24150
> > > <https://issues.apache.org/jira/browse/FLINK-24150>
> > >
> > > Thanks
> > > Surendra
> > >
> > > On Fri, Nov 24, 2023 at 4:16 PM Gyula Fóra 
> wrote:
> > >
> > > > Hi Surendra!
> > > >
> > > > The resource configuration in Flink is pretty well established and
> > > > covers setting both memory requests and limits (through the limit
> > > factor.)
> > > >
> > > > Could you please elaborate why you think this change is a good
> > addition?
> > > >
> > > > I see a few downsides:
> > > >  - It complicates memory configuration by adding new options without
> > > > actually enabling anything new
> > > >  - Existing pipelines with podTemplates may suddenly start running
> with
> > > > different memory settings in prod after this change
> > > >
> > > > So at this point I am slightly against making this change, but I
> would
> > > like
> > > > to hear the thoughts of the community on this matter.
> > > >
> > > > Thanks!
> > > > Gyula
> > > >
> > > > On Thu, Nov 23, 2023 at 2:00 PM Surendra Singh Lilhore <
> > > > surendralilh...@apache.org> wrote:
> > > >
> > > > > Hello everyone,
> > > > >
> > > > > I've encountered an issue while using the flink open source
> > > > > kubernetes operator for Flink deployment. Despite setting resource
> > > limits
> > > > > in the pod template, it appears that these limits are not
> considered
> > > > during
> > > > > TaskManager (TM) pod deployment. Upon code investigation, it seems
> > the
> > > > > limits are being overridden by the default limit factor in
> > > > > KubernetesUtils#getResourceRequirements()
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/utils/KubernetesUtils.java#L372
> > > > > >
> > > > > .
> > > > >
> > > > > The current behavior of Flink only considers the limit from the
> > default
> > > > > factor, neglecting pod template resource limits. I propose Flink
> > should
> > > > > incorporate both the limit factor and pod template resource limits,
> > > > taking
> > > > > the maximum value.
> > > > >
> > > > > I've raised the issue and submitted a pull request:  FLINK-33609
> > > > > <https://github.com/apache/flink/pull/23768>
> > > > >
> > > > > During the review process, a valid concern was raised regarding the
> > > > > proposed changes. The suggestion is to initiate a quick discussion,
> > as
> > > > this
> > > > > modification will significantly alter the resource handling logic.
> > It's
> > > > > emphasized that maintaining consistency in the logic for both
> > resource
> > > > > requests and limits is crucial, rather than applying changes to
> only
> > > one
> > > > of
> > > > > them.
> > > > >
> > > > > I would appreciate any feedback on this.
> > > > >
> > > > > Thank you for your time and contributions to the Flink project.
> > > > >
> > > > > Thank you,
> > > > > Surendra
> > > > >
> > > >
> > >
> >
>

Re: Discussion: [FLINK-33609] Take into account the resource limit specified in the pod template

2023-11-24 Thread Gyula Fóra

I think a better and more Kubernetes canonical approach would be what is
outlined in this ticket:

https://issues.apache.org/jira/browse/FLINK-33548

The upside is that it will work for all operator supported Flink versions
and is much simpler to use than the pod template.

What do you think?

Gyula

On Fri, 24 Nov 2023 at 19:50, Surendra Singh Lilhore <
surendralilh...@apache.org> wrote:

> Hi Gyula,
>
> Thank you for raising these valid concerns and sharing your perspective. I
> agree that this new change will impact the existing pipelines.
>
> I brought up this issue because in Kubernetes environments, resource
> configuration is typically managed through Kubernetes objects like
> container templates or Custom Resources (CR). To align with this
> established practice in Kubernetes, I was planning to utilize pod templates
> in the FlinkDeployment
> <
> https://github.com/apache/flink-kubernetes-operator/blob/main/helm/flink-kubernetes-operator/crds/flinkdeployments.flink.apache.org-v1.yml#L61
> >
> CR instead of exclusively relying on Flink-specific configuration.
>
>
> There is similar type of issue raised : FLINK-24150
> <https://issues.apache.org/jira/browse/FLINK-24150>
>
> Thanks
> Surendra
>
> On Fri, Nov 24, 2023 at 4:16 PM Gyula Fóra  wrote:
>
> > Hi Surendra!
> >
> > The resource configuration in Flink is pretty well established and
> > covers setting both memory requests and limits (through the limit
> factor.)
> >
> > Could you please elaborate why you think this change is a good addition?
> >
> > I see a few downsides:
> >  - It complicates memory configuration by adding new options without
> > actually enabling anything new
> >  - Existing pipelines with podTemplates may suddenly start running with
> > different memory settings in prod after this change
> >
> > So at this point I am slightly against making this change, but I would
> like
> > to hear the thoughts of the community on this matter.
> >
> > Thanks!
> > Gyula
> >
> > On Thu, Nov 23, 2023 at 2:00 PM Surendra Singh Lilhore <
> > surendralilh...@apache.org> wrote:
> >
> > > Hello everyone,
> > >
> > > I've encountered an issue while using the flink open source
> > > kubernetes operator for Flink deployment. Despite setting resource
> limits
> > > in the pod template, it appears that these limits are not considered
> > during
> > > TaskManager (TM) pod deployment. Upon code investigation, it seems the
> > > limits are being overridden by the default limit factor in
> > > KubernetesUtils#getResourceRequirements()
> > > <
> > >
> >
> https://github.com/apache/flink/blob/master/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/utils/KubernetesUtils.java#L372
> > > >
> > > .
> > >
> > > The current behavior of Flink only considers the limit from the default
> > > factor, neglecting pod template resource limits. I propose Flink should
> > > incorporate both the limit factor and pod template resource limits,
> > taking
> > > the maximum value.
> > >
> > > I've raised the issue and submitted a pull request:  FLINK-33609
> > > <https://github.com/apache/flink/pull/23768>
> > >
> > > During the review process, a valid concern was raised regarding the
> > > proposed changes. The suggestion is to initiate a quick discussion, as
> > this
> > > modification will significantly alter the resource handling logic. It's
> > > emphasized that maintaining consistency in the logic for both resource
> > > requests and limits is crucial, rather than applying changes to only
> one
> > of
> > > them.
> > >
> > > I would appreciate any feedback on this.
> > >
> > > Thank you for your time and contributions to the Flink project.
> > >
> > > Thank you,
> > > Surendra
> > >
> >
>

Re: Discussion: [FLINK-33609] Take into account the resource limit specified in the pod template

2023-11-24 Thread Gyula Fóra

Hi Surendra!

The resource configuration in Flink is pretty well established and
covers setting both memory requests and limits (through the limit factor.)

Could you please elaborate why you think this change is a good addition?

I see a few downsides:
 - It complicates memory configuration by adding new options without
actually enabling anything new
 - Existing pipelines with podTemplates may suddenly start running with
different memory settings in prod after this change

So at this point I am slightly against making this change, but I would like
to hear the thoughts of the community on this matter.

Thanks!
Gyula

On Thu, Nov 23, 2023 at 2:00 PM Surendra Singh Lilhore <
surendralilh...@apache.org> wrote:

> Hello everyone,
>
> I've encountered an issue while using the flink open source
> kubernetes operator for Flink deployment. Despite setting resource limits
> in the pod template, it appears that these limits are not considered during
> TaskManager (TM) pod deployment. Upon code investigation, it seems the
> limits are being overridden by the default limit factor in
> KubernetesUtils#getResourceRequirements()
> <
> https://github.com/apache/flink/blob/master/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/utils/KubernetesUtils.java#L372
> >
> .
>
> The current behavior of Flink only considers the limit from the default
> factor, neglecting pod template resource limits. I propose Flink should
> incorporate both the limit factor and pod template resource limits, taking
> the maximum value.
>
> I've raised the issue and submitted a pull request:  FLINK-33609
> 
>
> During the review process, a valid concern was raised regarding the
> proposed changes. The suggestion is to initiate a quick discussion, as this
> modification will significantly alter the resource handling logic. It's
> emphasized that maintaining consistency in the logic for both resource
> requests and limits is crucial, rather than applying changes to only one of
> them.
>
> I would appreciate any feedback on this.
>
> Thank you for your time and contributions to the Flink project.
>
> Thank you,
> Surendra
>

Re: [DISCUSS] Allow TwoPhaseCommittingSink WithPreCommitTopology to alter the type of the Committable

2023-11-24 Thread Gyula Fóra

Hi Peter!

Thank you for the analysis of the options.

I don't really have a strong opinion, but in general I am in favor of the
mix in style interfaces.
We follow the same approach for table sources / sinks as well.

Some other benefits I see in addition to what you mentioned:
 - Easier to introduce new experimental / public-evolving interfaces in the
future
 - Easier to declare parts of the api stable going forward as it's not all
or nothing

The ability to do proper compile time validation is definitely a downside
but this should mostly make initial development a little harder I believe.

Cheers,
Gyula

On Thu, Nov 23, 2023 at 1:25 PM Péter Váry 
wrote:

> We had a longer discussion with Gordon yesterday.
> The main conclusion was that moving to a completely new interface is not
> justified, and we try to improve the current one.
>
> Another ask from Gordon was to check when the user will be notified if the
> parameter types are incorrect using the mixin approach.
> Imagine the type definition below:
>
> private static class
> TestTwoPhaseCommittingSinkWithPreCommitTopologyWrongMixin
> implements
> TwoPhaseCommittingSinkWithPreCommitTopology,
> WithPreCommitTopology {
>
> The parametrization of the above interfaces contradicts each other:
>
>- TwoPhaseCommittingSinkWithPreCommitToplogy
>   - Input - Interger
>   - WriterResult - Long
>   - Committable - String)
>- WithPreCommitToplogy
>   - WriteResult - Boolean
>   - Committable - Void
>
>
> Sadly, I was not able to find a solution where we could notify the user at
> job startup time. The first error the user will get is when the first
> record is processed/committed. Talked with Gyula Fora, and we discussed the
> possibility to use the TypeExtractor to get the types. We have decided that
> it could work in some cases, but would not be a generic solution. See the
> "NOTES FOR USERS OF THIS CLASS" [1]
>
> This missing feature would justify abandoning the mixin solution, and
> sticking to creating individual interfaces, like:
>
>- *TwoPhaseCommittingSink* - When no pre-commit topology is needed -
>kept because it is enough for most of the use-cases.
>- *TwoPhaseCommittingSinkWithPreCommitTopology* - When pre-commit
>topology is needed with transformation in the pre commit stage - the new
>generic interface (could be internal)
>- *WithPreWriteTopology* - kept as it is
>- *WithPreCommitTopology* - extends
>TwoPhaseCommittingSinkWithPreCommitTopology with the transformation
> method
>(classes from streaming package is needed, so can not be merged with
>TwoPhaseCommittingSinkWithPreCommitTopology)
>- *WithPostCommitTopology* - kept as it is (extends only
>TwoPhaseCommittingSink, so no pre-commit topology is allowed)
>- *WithPostCommitTopologyWithPreCommitTopology* - extends
>WithPreCommitTopology with the same method as WithPostCommitTopology
>
> I don't really like the `WithPostCommitTopologyWithPreCommitTopology`
> complex interface, and if we start adding new features then the number of
> the interfaces could exponentially grow, but I agree that the type checking
> is important. I don't have a strong opinion, but I am inclined to vote for
> moving in the direction of the individual intefaces.
>
> What do you prefer?
>
>1. Go with the mixin approach
>   1. Better extendability
>   2. Fewer interfaces (only with 1 now, but later this could be more)
>   3. Easier to understand (IMHO)
>2. Stick with the combined interfaces approach (some mixin, like
>WithPreWriteTopology, some combined like
>WithPostCommitTopologyWithPreCommitTopology)
>   1. Better error messages
>   2. Less disruptive change (still breaking for implementations of
>   WithPreCommitTopology)
>3. Do you have a better idea?
>
>
> Thanks,
> Peter
>
> CC: Jiabao Sun - as he might be interested in this discussion
>
> [1] -
>
> https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/api/java/typeutils/TypeExtractor.html
>
>
> Péter Váry  ezt írta (időpont: 2023. okt.
> 25.,
> Sze, 16:02):
>
> > Hi Gordon,
> >
> > Thanks for the review, here are my thoughts:
> >
> > > In terms of the abstraction layering, I was wondering if you've also
> > considered this approach which I've quickly sketched in my local fork:
> >
> https://github.com/tzulitai/flink/commit/e84e3ac57ce023c35037a8470fefdfcad877bcae
> >
> > I think we have a few design issues here:
> > - How to handle the old interface where the transformation is not needed
> > in the pre-commit phase? - As you have proposed, default method
> > implementation is a nice solution here, as we do not really have to
> change
> > everything in the transformation process.
> > - How to handle the WithPostCommitTopology interface? - Currently the
> > parent interface for the sink with a post commit topology is strictly a
> > single interface (TwoPhaseCommittingSink) and we want to add this to both

[ANNOUNCE] Apache Flink Kubernetes Operator 1.7.0 released

2023-11-22 Thread Gyula Fóra

The Apache Flink community is very happy to announce the release of Apache
Flink Kubernetes Operator 1.7.0.

The Flink Kubernetes Operator allows users to manage their Apache Flink
applications and their lifecycle through native k8s tooling like kubectl.

Release highlights:
 - Standalone autoscaler module
 - Improved autoscaler metric tracking
 - Savepoint triggering improvements
 - Java 17 & 21 support

Release blogpost:
https://flink.apache.org/2023/11/22/apache-flink-kubernetes-operator-1.7.0-release-announcement/

The release is available for download at:
https://flink.apache.org/downloads.html

Maven artifacts for Flink Kubernetes Operator can be found at:
https://search.maven.org/artifact/org.apache.flink/flink-kubernetes-operator

Official Docker image for Flink Kubernetes Operator can be found at:
https://hub.docker.com/r/apache/flink-kubernetes-operator

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353462

We would like to thank all contributors of the Apache Flink community who
made this release possible!

Regards,
Gyula Fora

[RESULT] [VOTE] Apache Flink Kubernetes Operator Release 1.7.0, release candidate #1

2023-11-21 Thread Gyula Fóra

I'm happy to announce that we have unanimously approved this release.

There are 5 approving votes, 3 of which are binding:

* Marton Balassi (binding)
* Gyula Fora (binding)
* Maximilian Mixhels (binding)
* Rui Fan (non-binding)
* Mate Czagany (non-binding)

There are no disapproving votes.

Thanks everyone!
Gyula

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.7.0, release candidate #1

2023-11-21 Thread Gyula Fóra

Closing this vote thread, results will be announced in a separate email.

Gyula

On Mon, Nov 20, 2023 at 7:22 PM Maximilian Michels  wrote:

> +1 (binding)
>
> 1. Downloaded the archives, checksums, and signatures
> 2. Verified the signatures and checksums
> 3. Extract and inspect the source code for binaries
> 4. Compiled and tested the source code via mvn verify
> 5. Verified license files / headers
> 6. Deployed helm chart to test cluster
> 7. Build and ran dynamic autoscaling example image
> 8. Tested autoscaling without rescaling API
>
> Hit a non-fatal error collecting metrics in the stabilization phase
> (this is a new feature), not a release blocker though [1].
>
> -Max
>
> [1] Caused by: org.apache.flink.runtime.rest.util.RestClientException:
> [org.apache.flink.runtime.rest.handler.RestHandlerException: Cannot
> connect to ResourceManager right now. Please try to refresh.
> at
> org.apache.flink.runtime.rest.handler.resourcemanager.AbstractResourceManagerHandler.lambda$getResourceManagerGateway$0(AbstractResourceManagerHandler.java:91)
>  at
> java.base/java.util.Optional.orElseThrow(Unknown Source)
> at
> org.apache.flink.runtime.rest.handler.resourcemanager.AbstractResourceManagerHandler.getResourceManagerGateway(AbstractResourceManagerHandler.java:89)
> ...
>
> On Mon, Nov 20, 2023 at 5:48 PM Márton Balassi 
> wrote:
> >
> > +1 (binding)
> >
> > - Verified Helm repo works as expected, points to correct image tag,
> build,
> > version
> > - Verified basic examples + checked operator logs everything looks as
> > expected
> > - Verified hashes, signatures and source release contains no binaries
> > - Ran built-in tests, built jars + docker image from source successfully
> > - Upgraded the operator and the CRD from 1.6.1 to 1.7.0
> >
> > Best,
> > Marton
> >
> > On Mon, Nov 20, 2023 at 2:03 PM Gyula Fóra  wrote:
> >
> > > +1 (binding)
> > >
> > > Verified:
> > >  - Release files, maven repo contents, checksums, signature
> > >  - Verified and installed from Helm chart
> > >  - Ran basic stateful example and verified
> > >- Upgrade flow
> > >- No errors in logs
> > >- Autoscaler (turn on/off, verify configmap cleared correctly)
> > >- In-place scaling with 1.18 and adaptive scheduler
> > >  - Built from source with Java 11 & 17
> > >  - Checked release notes
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Fri, Nov 17, 2023 at 1:59 PM Rui Fan <1996fan...@gmail.com> wrote:
> > >
> > > > +1(non-binding)
> > > >
> > > > - Downloaded artifacts from dist
> > > > - Verified SHA512 checksums
> > > > - Verified GPG signatures
> > > > - Build the source with java-11 and java-17
> > > > - Verified the license header
> > > > - Verified that chart and appVersion matches the target release
> > > > - RC repo works as Helm rep(helm repo add
> flink-operator-repo-1.7.0-rc1
> > > >
> > > >
> > >
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.7.0-rc1/
> > > > )
> > > > - Verified Helm chart can be installed  (helm install
> > > > flink-kubernetes-operator
> > > > flink-operator-repo-1.7.0-rc1/flink-kubernetes-operator --set
> > > > webhook.create=false)
> > > > - Submitted the autoscaling demo, the autoscaler works well with
> rescale
> > > > api (kubectl apply -f autoscaling.yaml)
> > > > - Download Autoscaler standalone: wget
> > > >
> > > >
> > >
> https://repository.apache.org/content/repositories/orgapacheflink-1672/org/apache/flink/flink-autoscaler-standalone/1.7.0/flink-autoscaler-standalone-1.7.0.jar
> > > > - Ran Autoscaler standalone locally, it works well with rescale api
> > > >
> > > > Best,
> > > > Rui
> > > >
> > > > On Fri, Nov 17, 2023 at 2:45 AM Mate Czagany 
> wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > - Checked signatures, checksums
> > > > > - No binaries found in the source release
> > > > > - Verified all source files contain the license header
> > > > > - All pom files point to the correct version
> > > > > - Verified Helm chart version and appVersion
> > > > > - Verified Docker image tag
> > > > > - Ran flink-autoscaler-standalone JAR downloaded from the maven
>

Re: [VOTE] FLIP-393: Make QueryOperations SQL serializable

2023-11-21 Thread Gyula Fóra

+1 (binding)

Gyula

On Tue, 21 Nov 2023 at 13:11, xiangyu feng  wrote:

> +1 (non-binding)
>
> Thanks for driving this.
>
> Best,
> Xiangyu Feng
>
>
> Ferenc Csaky  于2023年11月21日周二 20:07写道：
>
> > +1 (non-binding)
> >
> > Lookgin forward to this!
> >
> > Best,
> > Ferenc
> >
> >
> >
> >
> > On Tuesday, November 21st, 2023 at 12:21, Martijn Visser <
> > martijnvis...@apache.org> wrote:
> >
> >
> > >
> > >
> > > +1 (binding)
> > >
> > > Thanks for driving this.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > > On Tue, Nov 21, 2023 at 12:18 PM Benchao Li libenc...@apache.org
> wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Dawid Wysakowicz wysakowicz.da...@gmail.com 于2023年11月21日周二 18:56写道：
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > Thank you to everyone for the feedback on FLIP-393: Make
> > QueryOperations
> > > > > SQL serializable[1]
> > > > > which has been discussed in this thread [2].
> > > > >
> > > > > I would like to start a vote for it. The vote will be open for at
> > least 72
> > > > > hours unless there is an objection or not enough votes.
> > > > >
> > > > > [1] https://cwiki.apache.org/confluence/x/vQ2ZE
> > > > > [2]
> https://lists.apache.org/thread/ztyk68brsbmwwo66o1nvk3f6fqqhdxgk
> > > >
> > > > --
> > > >
> > > > Best,
> > > > Benchao Li
> >
>

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.7.0, release candidate #1

2023-11-20 Thread Gyula Fóra

+1 (binding)

Verified:
 - Release files, maven repo contents, checksums, signature
 - Verified and installed from Helm chart
 - Ran basic stateful example and verified
   - Upgrade flow
   - No errors in logs
   - Autoscaler (turn on/off, verify configmap cleared correctly)
   - In-place scaling with 1.18 and adaptive scheduler
 - Built from source with Java 11 & 17
 - Checked release notes

Cheers,
Gyula

On Fri, Nov 17, 2023 at 1:59 PM Rui Fan <1996fan...@gmail.com> wrote:

> +1(non-binding)
>
> - Downloaded artifacts from dist
> - Verified SHA512 checksums
> - Verified GPG signatures
> - Build the source with java-11 and java-17
> - Verified the license header
> - Verified that chart and appVersion matches the target release
> - RC repo works as Helm rep(helm repo add flink-operator-repo-1.7.0-rc1
>
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.7.0-rc1/
> )
> - Verified Helm chart can be installed  (helm install
> flink-kubernetes-operator
> flink-operator-repo-1.7.0-rc1/flink-kubernetes-operator --set
> webhook.create=false)
> - Submitted the autoscaling demo, the autoscaler works well with rescale
> api (kubectl apply -f autoscaling.yaml)
> - Download Autoscaler standalone: wget
>
> https://repository.apache.org/content/repositories/orgapacheflink-1672/org/apache/flink/flink-autoscaler-standalone/1.7.0/flink-autoscaler-standalone-1.7.0.jar
> - Ran Autoscaler standalone locally, it works well with rescale api
>
> Best,
> Rui
>
> On Fri, Nov 17, 2023 at 2:45 AM Mate Czagany  wrote:
>
> > +1 (non-binding)
> >
> > - Checked signatures, checksums
> > - No binaries found in the source release
> > - Verified all source files contain the license header
> > - All pom files point to the correct version
> > - Verified Helm chart version and appVersion
> > - Verified Docker image tag
> > - Ran flink-autoscaler-standalone JAR downloaded from the maven
> repository
> > - Tested autoscaler upscales correctly on load with Flink 1.18 rescaling
> > API
> >
> > Thanks,
> > Mate
> >
> > Gyula Fóra  ezt írta (időpont: 2023. nov. 15., Sze,
> > 16:37):
> >
> > > Hi Everyone,
> > >
> > > Please review and vote on the release candidate #1 for the version
> 1.7.0
> > of
> > > Apache Flink Kubernetes Operator,
> > > as follows:
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > > **Release Overview**
> > >
> > > As an overview, the release consists of the following:
> > > a) Kubernetes Operator canonical source distribution (including the
> > > Dockerfile), to be deployed to the release repository at
> dist.apache.org
> > > b) Kubernetes Operator Helm Chart to be deployed to the release
> > repository
> > > at dist.apache.org
> > > c) Maven artifacts to be deployed to the Maven Central Repository
> > > d) Docker image to be pushed to dockerhub
> > >
> > > **Staging Areas to Review**
> > >
> > > The staging areas containing the above mentioned artifacts are as
> > follows,
> > > for your review:
> > > * All artifacts for a,b) can be found in the corresponding dev
> repository
> > > at dist.apache.org [1]
> > > * All artifacts for c) can be found at the Apache Nexus Repository [2]
> > > * The docker image for d) is staged on github [3]
> > >
> > > All artifacts are signed with the key 21F06303B87DAFF1 [4]
> > >
> > > Other links for your review:
> > > * JIRA release notes [5]
> > > * source code tag "release-1.7.0-rc1" [6]
> > > * PR to update the website Downloads page to
> > > include Kubernetes Operator links [7]
> > >
> > > **Vote Duration**
> > >
> > > The voting time will run for at least 72 hours.
> > > It is adopted by majority approval, with at least 3 PMC affirmative
> > votes.
> > >
> > > **Note on Verification**
> > >
> > > You can follow the basic verification guide here[8].
> > > Note that you don't need to verify everything yourself, but please make
> > > note of what you have tested together with your +- vote.
> > >
> > > Cheers!
> > > Gyula Fora
> > >
> > > [1]
> > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.7.0-rc1/
> > > [2]
> > >
> https://repository.apache.org/content/repositories/orgapacheflink-1672/
> > > [3] ghcr.io/apache/flink-kubernetes-operator:ccb10b8
> > > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > [5]
> > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353462
> > > [6]
> > >
> >
> https://github.com/apache/flink-kubernetes-operator/tree/release-1.7.0-rc1
> > > [7] https://github.com/apache/flink-web/pull/699
> > > [8]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> > >
> >
>

Re: [VOTE] FLIP-378: Support Avro timestamp with local timezone

2023-11-19 Thread Gyula Fóra

+1

Also, while technically enough , i prefer to exclude weekend from the 72
hours. So this vote should have run for a little longer ideally

But no harm done :)

Gyula

On Mon, 20 Nov 2023 at 03:46, Peter Huang 
wrote:

> Thanks for pointing this out. I will follow the process in the future.
>
>
> Best Regards
> Peter Huang
>
> On Sun, Nov 19, 2023 at 5:55 PM Leonard Xu  wrote:
>
> > Thanks Peter for driving this.
> >
> > Please pay attention that although the VOTE thread has enough binding
> > votes apart from Gyula and Martijn’s votes, I’d like to point out that
> the
> > +1 in discussion thread should not be listed here, it’s not a formal
> vote.
> >
> > Best,
> > Leonard
> >
> > > 2023年11月20日 上午7:55，Peter Huang  写道：
> > >
> > > Hi everyone,
> > >
> > > The voting time for [VOTE]  FLIP-378: Support Avro timestamp with local
> > > timezone [1] has
> > > passed. I'm closing the vote now.
> > >
> > > There were 7 votes, of which 6 were binding:
> > >
> > > - Gyula Fora (binding) +1 in the discussion thread
> > > - Jark Wu (binding)
> > > - Jing Ge (binding)
> > > - Leonard Xu (binding)
> > > - Martijn Visser (binding)  +1 in the discussion thread
> > > - Matyas Orhidi (binding)
> > > - MingLiang Liu (non-binding)
> > >
> > >
> > > Thus, FLIP-378 has been accepted.
> > >
> > > Thanks everyone for joining the discussion and giving feedback!
> > >
> > > [1] https://lists.apache.org/thread/7hls4813xmq01wbmo90jtfb5chr3mpr2
> > >
> > > Best Regards
> > > Peter Huang
> >
> >
>

SQL / Table input format for Kafka that can access full record

2023-11-16 Thread Gyula Fóra

Hey All!

I have been browsing through the codebase but don't really have a good
answer.

Is there any way to access the entire Kafka consumer record (key, value,
headers) when deserializing into a dynamic table? I see that we can
specify, key format , value format and metadata columns independently but
it seems like none of the format abstractions allow us to actually access
all these at the same time to deserialize.

One use case would be integrating with schema stores that store schema
information in the record header instead of the value.

Does anyone know any examples of something like this? Or how would you go
about adding this functionality?

Thanks!
Gyula

[VOTE] Apache Flink Kubernetes Operator Release 1.7.0, release candidate #1

2023-11-15 Thread Gyula Fóra

Hi Everyone,

Please review and vote on the release candidate #1 for the version 1.7.0 of
Apache Flink Kubernetes Operator,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

**Release Overview**

As an overview, the release consists of the following:
a) Kubernetes Operator canonical source distribution (including the
Dockerfile), to be deployed to the release repository at dist.apache.org
b) Kubernetes Operator Helm Chart to be deployed to the release repository
at dist.apache.org
c) Maven artifacts to be deployed to the Maven Central Repository
d) Docker image to be pushed to dockerhub

**Staging Areas to Review**

The staging areas containing the above mentioned artifacts are as follows,
for your review:
* All artifacts for a,b) can be found in the corresponding dev repository
at dist.apache.org [1]
* All artifacts for c) can be found at the Apache Nexus Repository [2]
* The docker image for d) is staged on github [3]

All artifacts are signed with the key 21F06303B87DAFF1 [4]

Other links for your review:
* JIRA release notes [5]
* source code tag "release-1.7.0-rc1" [6]
* PR to update the website Downloads page to
include Kubernetes Operator links [7]

**Vote Duration**

The voting time will run for at least 72 hours.
It is adopted by majority approval, with at least 3 PMC affirmative votes.

**Note on Verification**

You can follow the basic verification guide here[8].
Note that you don't need to verify everything yourself, but please make
note of what you have tested together with your +- vote.

Cheers!
Gyula Fora

[1]
https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.7.0-rc1/
[2] https://repository.apache.org/content/repositories/orgapacheflink-1672/
[3] ghcr.io/apache/flink-kubernetes-operator:ccb10b8
[4] https://dist.apache.org/repos/dist/release/flink/KEYS
[5]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353462
[6]
https://github.com/apache/flink-kubernetes-operator/tree/release-1.7.0-rc1
[7] https://github.com/apache/flink-web/pull/699
[8]
https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release

Re: [DISCUSS] FLIP-378: Support Avro timestamp with local timezone

2023-11-04 Thread Gyula Fóra

+1

Gyula

On Thu, Nov 2, 2023 at 6:18 AM Martijn Visser 
wrote:

> +1
>
> On Thu, Nov 2, 2023 at 12:44 PM Leonard Xu  wrote:
> >
> >
> > > Thanks @Leonard Xu . Two minor versions are
> definitely needed for flip the configs.
> >
> > Sorry, Peter. I thought the next minor versions are 1.19、2.0， but
> actually it should be 1.19、1.20、2.0 from current community version plan
> IIUC, so remove the config in 2.0 should be okay if the 1.20 version exists
> .
> >
> > Best,
> > Leonard
> >
> >
> > >
> > > On Mon, Oct 30, 2023 at 8:55 PM Leonard Xu  xbjt...@gmail.com>> wrote:
> > > Thanks @Peter for driving this FLIP
> > >
> > > +1 from my side, the timestamp semantics mapping looks good to me.
> > >
> > > >  In the end, the legacy behavior will be dropped in
> > > > Flink 2.0
> > > > I don’t think we can drop this option which introduced in 1.19 and
> drop in 2.0, the API removal requires at least two minor versions.
> > >
> > >
> > > Best,
> > > Leonard
> > >
> > > > 2023年10月31日 上午11:18，Peter Huang  huangzhenqiu0...@gmail.com>> 写道：
> > > >
> > > > Hi Devs,
> > > >
> > > > Currently, Flink Avro Format doesn't support the Avro time
> (milli/micros)
> > > > with local timezone type.
> > > > Although the Avro timestamp (millis/micros) type is supported and is
> mapped
> > > > to flink timestamp without timezone,
> > > > it is not compliant to semantics defined in Consistent timestamp
> types in
> > > > Hadoop SQL engines
> > > > <
> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.n699ftkvhjlo
> <
> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.n699ftkvhjlo
> >>
> > > > .
> > > >
> > > > I propose to support Avro timestamps with the compliance to the
> mapping
> > > > semantics [1] by using a configuration flag.
> > > > To keep back compatibility, current mapping is kept as default
> behavior.
> > > > Users can explicitly turn on the new mapping
> > > > by setting it to false. In the end, the legacy behavior will be
> dropped in
> > > > Flink 2.0
> > > >
> > > > Looking forward to your feedback.
> > > >
> > > >
> > > > [1]
> > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-378%3A+Support+Avro+timestamp+with+local+timezone
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-378%3A+Support+Avro+timestamp+with+local+timezone
> >
> > > >
> > > >
> > > > Best Regards
> > > >
> > > > Peter Huang
> > >
> >
>

[DISCUSS] Kubernetes Operator 1.7.0 release planning

2023-10-30 Thread Gyula Fóra

Hi all!

I would like to kick off the release planning for the operator 1.7.0
release. We have added quite a lot of new functionality over the last few
weeks and I think the operator is in a good state to kick this off.

Based on the original release schedule we had Nov 1 as the proposed feature
freeze date and Nov 7 as the date for the release cut / rc1.

I think this is reasonable as I am not aware of any big features / bug
fixes being worked on right now. Given the size of the changes related to
the autoscaler module refactor we should try to focus the remaining time on
testing.

I am happy to volunteer as a release manager but I am of course open to
working together with someone on this.

What do you think?

Cheers,
Gyula

Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.6.1 released

2023-10-30 Thread Gyula Fóra

Thank you Rui for taking care of this!


On Mon, Oct 30, 2023 at 11:55 AM Rui Fan <1996fan...@gmail.com> wrote:

> The Apache Flink community is very happy to announce the release of Apache
> Flink Kubernetes Operator 1.6.1.
>
> Please check out the release blog post for an overview of the release:
>
> https://flink.apache.org/2023/10/27/apache-flink-kubernetes-operator-1.6.1-release-announcement/
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Maven artifacts for Flink Kubernetes Operator can be found at:
>
> https://search.maven.org/artifact/org.apache.flink/flink-kubernetes-operator
>
> Official Docker image for Flink Kubernetes Operator applications can be
> found at:
> https://hub.docker.com/r/apache/flink-kubernetes-operator
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353784
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
> Regards,
> Rui Fan
>

Re: Operator 1.6 to Olm

2023-10-25 Thread Gyula Fóra

Thank you David!

I currently only see the 1.5.0 version as the latest, but I will check back
again later.

Cheers,
Gyula


On Wed, Oct 25, 2023 at 11:17 AM David Radley 
wrote:

> Hi,
> Fyi with some expert direction from James Busche, I have published the 1.6
> OLM and operatorhub.io versions of the Flink operator.  When 1.6.1 is out
> I will do the same again,
>  Kind regards, David.
>
>
>
> From: Gyula Fóra 
> Date: Tuesday, 10 October 2023 at 13:27
> To: dev@flink.apache.org 
> Subject: [EXTERNAL] Re: Operator 1.6 to Olm
> That would be great David, thank you!
>
> Gyula
>
> On Tue, 10 Oct 2023 at 14:13, David Radley 
> wrote:
>
> > Hi,
> > I notice that the latest version in olm of the operator is 1.5. I plan to
> > run the scripts to publish the 1.6 Flink operator to olm,
> >  Kind regards, David.
> >
> > Unless otherwise stated above:
> >
> > IBM United Kingdom Limited
> > Registered in England and Wales with number 741598
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> >
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
>

Re: [DISCUSS] FLIP-360: Merging ExecutionGraphInfoStore and JobResultStore into a single component

2023-10-23 Thread Gyula Fóra

I am a bit confused by the split in the CompletedJobStore / JobDetailsStore.
Seems like JobDetailsStore is simply a view on top of CompletedJobStore:
 - Maybe we should not call it a store? Is it storing anything?
 - Why couldn't the cleanup triggering be the responsibility of the
CompletedJobStore, wouldn't it make it simpler to have the storage/cleanup
related logic in a simple place?
 - Ideally the JobDetailsStore / JobDetailsProvider could be a very thin
interface exposed by the CompletedJobStore

Gyula

On Sat, Sep 30, 2023 at 2:18 AM Matthias Pohl
 wrote:

> Thanks for sharing your thoughts, Shammon FY. I kind of see your point.
>
> Initially, I didn't put much thought into splitting up the interface into
> two because the dispatcher would have been the only component dealing with
> the two interfaces. Having two interfaces wouldn't have added much value
> (in terms of testability and readability, I thought).
>
> But I iterated over the idea once more and came up with a new proposal that
> involves the two components CompletedJobStore and JobDetailsStore. It's not
> 100% what you suggested (because the retrieval of the ExecutionGraphInfo
> still lives in the CompletedJobStore) but the separation makes sense in my
> opinion:
> - The CompletedJobStore deals with the big data that might require
> accessing the disk.
> - The JobDetailsStore handles the small-footprint data that lives in memory
> all the time. Additionally, it takes care of actually deleting the metadata
> of the completed job in both stores if a TTL is configured.
>
> See FLIP-360 [1] with the newly added class and sequence diagrams and
> additional content. I only updated the Interfaces & Methods section (see
> diff [2]).
>
> I'm looking forward to feedback.
>
> Best,
> Matthias
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+merging+the+executiongraphinfostore+and+the+jobresultstore+into+a+single+component+completedjobstore
> [2]
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=263428420=14=13
>
> On Mon, Sep 18, 2023 at 1:20 AM Shammon FY  wrote:
>
> > Hi Matthias,
> >
> > Thanks for initiating this discussion, and +1 for overall from my side.
> > It's really strange to have two different places to store completed jobs,
> > this also brings about the complexity of Flink. I agree with using a
> > unified instance to store the completed job information.
> >
> > In terms of ability, `ExecutionGraphInfoStore` and `JobResultStore` are
> > different: one is mainly used for information display, and the other is
> for
> > failover. So after unifying storage, can we use different interfaces to
> > meet the different requirements for jobs? Adding all these methods for
> > different components into one interface such as `CompletedJobStore` may
> be
> > a little strange. What do you think?
> >
> > Best,
> > Shammon FY
> >
> >
> >
> > On Fri, Sep 8, 2023 at 8:08 PM Gyula Fóra  wrote:
> >
> > > Hi Matthias!
> > >
> > > Thank you for the detailed proposal, overall I am in favor of making
> this
> > > unification to simplify the logic and make the integration for external
> > > components more straightforward.
> > > I will try to read through the proposal more carefully next week and
> > > provide some detailed feedback.
> > >
> > > +1
> > >
> > > Thanks
> > > Gyula
> > >
> > > On Fri, Sep 8, 2023 at 8:36 AM Matthias Pohl  > > .invalid>
> > > wrote:
> > >
> > > > Just a bit more elaboration on the question that we need to answer
> > here:
> > > Do
> > > > we want to expose the internal ArchivedExecutionGraph data structure
> > > > through JSON?
> > > >
> > > > - The JSON approach allows the user to have (almost) full access to
> the
> > > > information (that would be otherwise derived from the REST API).
> > > Therefore,
> > > > there's no need to spin up a cluster to access this information.
> > > > Any information that shall be exposed through the REST API needs to
> be
> > > > well-defined in this JSON structure, though. Large parts of the
> > > > ArchivedExecutionGraph data structure (essentially anything that
> shall
> > be
> > > > used to populate the REST API) become public domain, though, which
> puts
> > > > more constraints on this data structure and makes it harder to change
> > it
> > > in
> > > > the future.
> > > >
> > > > - The binary data approac

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.6.1, release candidate #1

2023-10-23 Thread Gyula Fóra

+1 (binding)

- Verified checksums, signatures, source release content
- Helm repo works correctly and points to the correct image / version
- Installed operator, ran stateful example

Gyula

On Sat, Oct 21, 2023 at 1:43 PM Rui Fan <1996fan...@gmail.com> wrote:

> +1(non-binding)
>
> - Downloaded artifacts from dist
> - Verified SHA512 checksums
> - Verified GPG signatures
> - Build the source with java-11 and verified the licenses together
> - Verified that all POM files point to the same version.
> - Verified that chart and appVersion matches the target release
> - Verified that helm chart / values.yaml points to the RC docker image
> - Verified that RC repo works as Helm repo (helm repo add
> flink-operator-repo-1.6.1-rc1
>
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.6.1-rc1/
> )
> - Verified Helm chart can be installed  (helm install
> flink-kubernetes-operator
> flink-operator-repo-1.6.1-rc1/flink-kubernetes-operator --set
> webhook.create=false)
> - Submitted the autoscaling demo, the autoscaler works well (kubectl apply
> -f autoscaling.yaml)
> - Triggered a manual savepoint (update the yaml: savepointTriggerNonce:
> 101)
>
> Best,
> Rui
>
> On Sat, Oct 21, 2023 at 7:33 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi Everyone,
> >
> > Please review and vote on the release candidate #1 for the version 1.6.1
> of
> > Apache Flink Kubernetes Operator,
> > as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > **Release Overview**
> >
> > As an overview, the release consists of the following:
> > a) Kubernetes Operator canonical source distribution (including the
> > Dockerfile), to be deployed to the release repository at dist.apache.org
> > b) Kubernetes Operator Helm Chart to be deployed to the release
> repository
> > at dist.apache.org
> > c) Maven artifacts to be deployed to the Maven Central Repository
> > d) Docker image to be pushed to dockerhub
> >
> > **Staging Areas to Review**
> >
> > The staging areas containing the above mentioned artifacts are as
> follows,
> > for your review:
> > * All artifacts for a,b) can be found in the corresponding dev repository
> > at dist.apache.org [1]
> > * All artifacts for c) can be found at the Apache Nexus Repository [2]
> > * The docker image for d) is staged on github [3]
> >
> > All artifacts are signed with the
> > key B2D64016B940A7E0B9B72E0D7D0528B28037D8BC [4]
> >
> > Other links for your review:
> > * source code tag "release-1.6.1-rc1" [5]
> > * PR to update the website Downloads page to
> > include Kubernetes Operator links [6]
> > * PR to update the doc version of flink-kubernetes-operator[7]
> >
> > **Vote Duration**
> >
> > The voting time will run for at least 72 hours.
> > It is adopted by majority approval, with at least 3 PMC affirmative
> votes.
> >
> > **Note on Verification**
> >
> > You can follow the basic verification guide here[8].
> > Note that you don't need to verify everything yourself, but please make
> > note of what you have tested together with your +- vote.
> >
> > [1]
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.6.1-rc1/
> > [2]
> > https://repository.apache.org/content/repositories/orgapacheflink-1663/
> > [3]
> >
> https://github.com/apache/flink-kubernetes-operator/pkgs/container/flink-kubernetes-operator/139454270?tag=51eeae1
> > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [5]
> >
> https://github.com/apache/flink-kubernetes-operator/tree/release-1.6.1-rc1
> > [6] https://github.com/apache/flink-web/pull/690
> > [7] https://github.com/apache/flink-kubernetes-operator/pull/687
> > [8]
> >
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> >
> > Best,
> > Rui
> >
>

Re: [DISCUSS] Creating Kubernetes Operator release 1.6.1

2023-10-20 Thread Gyula Fóra

Thank you Rui, sounds good!

Gyula

On Fri, Oct 20, 2023 at 11:14 AM Rui Fan <1996fan...@gmail.com> wrote:

> Hi devs,
>
> I checked all commits that the main branch has but release-1.6 branch
> doesn't,
> all of them are improvements instead of bug fixes. So I will prepare the
> 1.6.1-rc1 based on the latest release-1.6 branch.
>
> Please let me know if any critical commit is missed at release-1.6 branch,
> thanks~
>
> And thanks Gyula's help!
>
> Best,
> Rui
>
> On Thu, Oct 19, 2023 at 2:25 PM Gyula Fóra  wrote:
>
>> Thanks Rui,
>> I would appreciate your help. Let's sync on the tasks offline.
>>
>> As for the bugfix commits. We can take a quick look and backport some
>> other
>> fixes if needed but we should focus on critical fixes / regressions to
>> make
>> the changes minimal given that the 1.7.0 is not so far anyway :)
>>
>> Cheers,
>> Gyula
>>
>>
>> On Thu, Oct 19, 2023 at 8:20 AM Rui Fan <1996fan...@gmail.com> wrote:
>>
>> > Hi Gyula,
>> >
>> > Thank you for driving this discussion!
>> >
>> > This release seems good to me, I have a question:
>> > I see some bugfix commits have been merged into
>> > the 1.6-release branch. Does it already contain all
>> > recent bugfix commits?
>> >
>> > Also, you said in the `Kubernetes Operator 1.6.0 release planning`[1]:
>> >
>> > > I am volunteering as the release manager but if someone else wants to
>> do
>> > it, I would also be happy to simply give assistance :)
>> >
>> > It's a minor version. For those who have never released before,
>> > a minor version may be a good entry point. Would you mind
>> > if I volunteer as the release manager for 1.6.1?
>> >
>> > [1] https://lists.apache.org/thread/5ynjv18nfoj6rvyhlz1g5y8qtxx6v1gd
>> >
>> > Best,
>> > Rui
>> >
>> > On Thu, Oct 19, 2023 at 1:06 PM Gyula Fóra 
>> wrote:
>> >
>> > > Hi All!
>> > >
>> > > I would like to propose to release the 1.6.1 patch version for the
>> > > Kubernetes operator. The release branch currently contains 2-3
>> critical
>> > > fixes for issues that many users have hit over time.
>> > >
>> > > Making this release now would allow us more time to wrap up and
>> finalize
>> > > the 1.7.0 release changes (some of which are quite big regarding the
>> > > autoscaler)
>> > >
>> > > If there are no objections I will prepare the release candidate. The
>> > > changeset is minimal but very important.
>> > >
>> > > Cheers,
>> > > Gyula
>> > >
>> >
>>
>

Re: [DISCUSS] Creating Kubernetes Operator release 1.6.1

2023-10-19 Thread Gyula Fóra

Thanks Rui,
I would appreciate your help. Let's sync on the tasks offline.

As for the bugfix commits. We can take a quick look and backport some other
fixes if needed but we should focus on critical fixes / regressions to make
the changes minimal given that the 1.7.0 is not so far anyway :)

Cheers,
Gyula


On Thu, Oct 19, 2023 at 8:20 AM Rui Fan <1996fan...@gmail.com> wrote:

> Hi Gyula,
>
> Thank you for driving this discussion!
>
> This release seems good to me, I have a question:
> I see some bugfix commits have been merged into
> the 1.6-release branch. Does it already contain all
> recent bugfix commits?
>
> Also, you said in the `Kubernetes Operator 1.6.0 release planning`[1]:
>
> > I am volunteering as the release manager but if someone else wants to do
> it, I would also be happy to simply give assistance :)
>
> It's a minor version. For those who have never released before,
> a minor version may be a good entry point. Would you mind
> if I volunteer as the release manager for 1.6.1?
>
> [1] https://lists.apache.org/thread/5ynjv18nfoj6rvyhlz1g5y8qtxx6v1gd
>
> Best,
> Rui
>
> On Thu, Oct 19, 2023 at 1:06 PM Gyula Fóra  wrote:
>
> > Hi All!
> >
> > I would like to propose to release the 1.6.1 patch version for the
> > Kubernetes operator. The release branch currently contains 2-3 critical
> > fixes for issues that many users have hit over time.
> >
> > Making this release now would allow us more time to wrap up and finalize
> > the 1.7.0 release changes (some of which are quite big regarding the
> > autoscaler)
> >
> > If there are no objections I will prepare the release candidate. The
> > changeset is minimal but very important.
> >
> > Cheers,
> > Gyula
> >
>

[DISCUSS] Creating Kubernetes Operator release 1.6.1

2023-10-18 Thread Gyula Fóra

Hi All!

I would like to propose to release the 1.6.1 patch version for the
Kubernetes operator. The release branch currently contains 2-3 critical
fixes for issues that many users have hit over time.

Making this release now would allow us more time to wrap up and finalize
the 1.7.0 release changes (some of which are quite big regarding the
autoscaler)

If there are no objections I will prepare the release candidate. The
changeset is minimal but very important.

Cheers,
Gyula

Re: [VOTE] FLIP-371: Provide initialization context for Committer creation in TwoPhaseCommittingSink

2023-10-11 Thread Gyula Fóra

Thanks , Peter.

+1

Gyula

On Wed, 11 Oct 2023 at 14:47, Péter Váry 
wrote:

> Hi all,
>
> Thank you to everyone for the feedback on FLIP-371[1].
> Based on the discussion thread [2], I think we are ready to take a vote to
> contribute this to Flink.
> I'd like to start a vote for it.
> The vote will be open for at least 72 hours (excluding weekends, unless
> there is an objection or an insufficient number of votes).
>
> Thanks,
> Peter
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-371%3A+Provide+initialization+context+for+Committer+creation+in+TwoPhaseCommittingSink
> [2] https://lists.apache.org/thread/v3mrspdlrqrzvbwm0lcgr0j4v03dx97c
>

Re: Operator 1.6 to Olm

2023-10-10 Thread Gyula Fóra

That would be great David, thank you!

Gyula

On Tue, 10 Oct 2023 at 14:13, David Radley  wrote:

> Hi,
> I notice that the latest version in olm of the operator is 1.5. I plan to
> run the scripts to publish the 1.6 Flink operator to olm,
>  Kind regards, David.
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
>

Re: [DISCUSS] Java Record support

2023-10-05 Thread Gyula Fóra

Hi,
I have opened a draft PR [1] that shows the minimal required changes and a
suggested unit test setup for Java version specific tests.
There is still some work to be done (run all benchmarks, add more tests for
compatibility/migration)

If you have time please review / comment on the approach here or on Github.

Cheers,
Gyula

[1] https://github.com/apache/flink/pull/23490

On Wed, Oct 4, 2023 at 7:09 PM Peter Huang 
wrote:

> +1 for the convenience of users.
>
> On Wed, Oct 4, 2023 at 8:05 AM Matthias Pohl  .invalid>
> wrote:
>
> > +1 Sounds like a good idea.
> >
> > On Wed, Oct 4, 2023 at 5:04 PM Gyula Fóra  wrote:
> >
> > > I will share my initial implementation soon, it seems to be pretty
> > > straightforward.
> > >
> > > Biggest challenge so far is setting tests so we can still compile
> against
> > > older versions but have tests for records . But I have working proposal
> > for
> > > that as well.
> > >
> > > Gyula
> > >
> > > On Wed, 4 Oct 2023 at 16:45, Chesnay Schepler 
> > wrote:
> > >
> > > > Kryo isn't required for this; newer versions do support records but
> we
> > > > want something like a PojoSerializer for records to be performant.
> > > >
> > > > The core challenges are
> > > > a) detecting records during type extraction
> > > > b) ensuring parameters are passed to the constructor in the right
> > order.
> > > >
> > > >  From what I remember from my own experiments this shouldn't exactly
> > > > /difficult/, but just a bit tedious to integrate into the Type
> > > > extraction stack.
> > > >
> > > > On 04/10/2023 16:14, Őrhidi Mátyás wrote:
> > > > > +1 This would be great
> > > > >
> > > > > On Wed, Oct 4, 2023 at 7:04 AM Gyula Fóra 
> > > wrote:
> > > > >
> > > > > Hi All!
> > > > >
> > > > > Flink 1.18 contains experimental Java 17 support but it misses
> > out
> > > > > on Java
> > > > > records which can be one of the nice benefits of actually using
> > > > > newer java
> > > > > versions.
> > > > >
> > > > > There is already a Jira to track this feature [1] but I am not
> > > > > aware of any
> > > > > previous efforts so far.
> > > > >
> > > > > Since records have pretty strong guarantees and many users
> would
> > > > > probably
> > > > > want to migrate from their POJOs, I think we should enhance the
> > > > > current
> > > > > Pojo TypeInfo/Serializer to accommodate for the records.
> > > > >
> > > > > I experimented with this locally and the changes are not huge
> as
> > > > > we only
> > > > > need to allow instantiating records through the constructor
> > instead
> > > > of
> > > > > setters. This would mean that the serialization format is
> > basically
> > > > > equivalent to the same non-record pojo, giving us backward
> > > > > compatibility
> > > > > and all the features of the Pojo serializer for basically free.
> > > > >
> > > > > We should make sure to not introduce any performance regression
> > in
> > > > the
> > > > > PojoSerializer but I am happy to open a preview PR if there is
> > > > > interest.
> > > > >
> > > > > There were mentions of upgrading Kryo to support this but I
> think
> > > > that
> > > > > would add unnecessary complexity.
> > > > >
> > > > > What do you all think?
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/FLINK-32380
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Java Record support

2023-10-04 Thread Gyula Fóra

I will share my initial implementation soon, it seems to be pretty
straightforward.

Biggest challenge so far is setting tests so we can still compile against
older versions but have tests for records . But I have working proposal for
that as well.

Gyula

On Wed, 4 Oct 2023 at 16:45, Chesnay Schepler  wrote:

> Kryo isn't required for this; newer versions do support records but we
> want something like a PojoSerializer for records to be performant.
>
> The core challenges are
> a) detecting records during type extraction
> b) ensuring parameters are passed to the constructor in the right order.
>
>  From what I remember from my own experiments this shouldn't exactly
> /difficult/, but just a bit tedious to integrate into the Type
> extraction stack.
>
> On 04/10/2023 16:14, Őrhidi Mátyás wrote:
> > +1 This would be great
> >
> > On Wed, Oct 4, 2023 at 7:04 AM Gyula Fóra  wrote:
> >
> > Hi All!
> >
> > Flink 1.18 contains experimental Java 17 support but it misses out
> > on Java
> > records which can be one of the nice benefits of actually using
> > newer java
> > versions.
> >
> > There is already a Jira to track this feature [1] but I am not
> > aware of any
> > previous efforts so far.
> >
> > Since records have pretty strong guarantees and many users would
> > probably
> > want to migrate from their POJOs, I think we should enhance the
> > current
> > Pojo TypeInfo/Serializer to accommodate for the records.
> >
> > I experimented with this locally and the changes are not huge as
> > we only
> > need to allow instantiating records through the constructor instead
> of
> > setters. This would mean that the serialization format is basically
> > equivalent to the same non-record pojo, giving us backward
> > compatibility
> > and all the features of the Pojo serializer for basically free.
> >
> > We should make sure to not introduce any performance regression in
> the
> > PojoSerializer but I am happy to open a preview PR if there is
> > interest.
> >
> > There were mentions of upgrading Kryo to support this but I think
> that
> > would add unnecessary complexity.
> >
> > What do you all think?
> >
> > Cheers,
> > Gyula
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-32380
> >
>

[DISCUSS] Java Record support

2023-10-04 Thread Gyula Fóra

Hi All!

Flink 1.18 contains experimental Java 17 support but it misses out on Java
records which can be one of the nice benefits of actually using newer java
versions.

There is already a Jira to track this feature [1] but I am not aware of any
previous efforts so far.

Since records have pretty strong guarantees and many users would probably
want to migrate from their POJOs, I think we should enhance the current
Pojo TypeInfo/Serializer to accommodate for the records.

I experimented with this locally and the changes are not huge as we only
need to allow instantiating records through the constructor instead of
setters. This would mean that the serialization format is basically
equivalent to the same non-record pojo, giving us backward compatibility
and all the features of the Pojo serializer for basically free.

We should make sure to not introduce any performance regression in the
PojoSerializer but I am happy to open a preview PR if there is interest.

There were mentions of upgrading Kryo to support this but I think that
would add unnecessary complexity.

What do you all think?

Cheers,
Gyula

[1]  https://issues.apache.org/jira/browse/FLINK-32380

[RESULT][VOTE] FLIP-361: Improve GC Metrics

2023-09-19 Thread Gyula Fóra

Hi Everyone,

The proposal, FLIP-361: Improve GC Metrics, has been unanimously accepted
with 15 votes (8 binding) .

+1 votes:

 - Gyula Fora (binding)
 - Rui Fan (binding)
 - Ahmed Hamdy
 - Chen Zhanghao
 - Conrad Jam
 - Samrat Deb
 - Xintong Song (binding)
 - Matt Wang
 - Dong Lin (binding)
 - Venkatakrishnan Sowrirajan
 - Maximilian Michels (binding)
 - Yangze Guo (binding)
 - Weihua Hu (binding)
 - Yuepeng Pan
 - Jing Ge (binding)

Thank you!
Gyula

Re: [VOTE] FLIP-361: Improve GC Metrics

2023-09-19 Thread Gyula Fóra

Thank you all!
Closing this vote

On Mon, Sep 18, 2023 at 6:16 PM Jing Ge  wrote:

> looks great, thanks!
> +1(binding)
>
> Best regards,
> Jing
>
> On Mon, Sep 18, 2023 at 8:11 AM Yuepeng Pan  wrote:
>
> > +1 (non-binding).
> >
> >
> > Best,
> > Yuepeng.
> >
> >
> >
> >
> >
> > 在 2023-09-16 12:52:09，"Weihua Hu"  写道：
> > >+1 (binding)
> > >
> > >Best,
> > >Weihua
> > >
> > >
> > >On Fri, Sep 15, 2023 at 4:28 PM Yangze Guo  wrote:
> > >
> > >> +1 (binding)
> > >>
> > >> Best,
> > >> Yangze Guo
> > >>
> > >> On Thu, Sep 14, 2023 at 5:47 PM Maximilian Michels 
> > wrote:
> > >> >
> > >> > +1 (binding)
> > >> >
> > >> > On Thu, Sep 14, 2023 at 4:26 AM Venkatakrishnan Sowrirajan
> > >> >  wrote:
> > >> > >
> > >> > > +1 (non-binding)
> > >> > >
> > >> > > On Wed, Sep 13, 2023, 6:55 PM Matt Wang  wrote:
> > >> > >
> > >> > > > +1 (non-binding)
> > >> > > >
> > >> > > >
> > >> > > > Thanks for driving this FLIP
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > >
> > >> > > > Best,
> > >> > > > Matt Wang
> > >> > > >
> > >> > > >
> > >> > > >  Replied Message 
> > >> > > > | From | Xintong Song |
> > >> > > > | Date | 09/14/2023 09:54 |
> > >> > > > | To |  |
> > >> > > > | Subject | Re: [VOTE] FLIP-361: Improve GC Metrics |
> > >> > > > +1 (binding)
> > >> > > >
> > >> > > > Best,
> > >> > > >
> > >> > > > Xintong
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Thu, Sep 14, 2023 at 2:40 AM Samrat Deb <
> decordea...@gmail.com
> > >
> > >> wrote:
> > >> > > >
> > >> > > > +1 ( non binding)
> > >> > > >
> > >> > > > These improved GC metrics will be a great addition.
> > >> > > >
> > >> > > > Bests,
> > >> > > > Samrat
> > >> > > >
> > >> > > > On Wed, 13 Sep 2023 at 7:58 PM, ConradJam 
> > >> wrote:
> > >> > > >
> > >> > > > +1 (non-binding)
> > >> > > > gc metrics help with autoscale tuning features
> > >> > > >
> > >> > > > Chen Zhanghao  于2023年9月13日周三
> 22:16写道：
> > >> > > >
> > >> > > > +1 (unbinding). Looking forward to it
> > >> > > >
> > >> > > > Best,
> > >> > > > Zhanghao Chen
> > >> > > > 
> > >> > > > 发件人: Gyula Fóra 
> > >> > > > 发送时间: 2023年9月13日 21:16
> > >> > > > 收件人: dev 
> > >> > > > 主题: [VOTE] FLIP-361: Improve GC Metrics
> > >> > > >
> > >> > > > Hi All!
> > >> > > >
> > >> > > > Thanks for all the feedback on FLIP-361: Improve GC Metrics
> [1][2]
> > >> > > >
> > >> > > > I'd like to start a vote for it. The vote will be open for at
> > least
> > >> 72
> > >> > > > hours unless there is an objection or insufficient votes.
> > >> > > >
> > >> > > > Cheers,
> > >> > > > Gyula
> > >> > > >
> > >> > > > [1]
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >>
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-361*3A*Improve*GC*Metrics__;JSsrKw!!IKRxdwAv5BmarQ!dpHSOqsSHlPJ5gCvZ2yxSGjcR4xA2N-mpGZ1w2jPuKb78aujNpbzENmi1e7B26d6v4UQ8bQZ7IQaUcI$
> > >> > > > [2]
> > >> > > >
> > >>
> >
> https://urldefense.com/v3/__https://lists.apache.org/thread/qqqv54vyr4gbp63wm2d12q78m8h95xb2__;!!IKRxdwAv5BmarQ!dpHSOqsSHlPJ5gCvZ2yxSGjcR4xA2N-mpGZ1w2jPuKb78aujNpbzENmi1e7B26d6v4UQ8bQZFdEMnAg$
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Best
> > >> > > >
> > >> > > > ConradJam
> > >> > > >
> > >> > > >
> > >> > > >
> > >>
> >
>

Re: [Discuss] CRD for flink sql gateway in the flink k8s operator

2023-09-19 Thread Gyula Fóra

Based on this I think we should start with simple Helm charts / templates
for creating the `FlinkDeployment` together with a separate Deployment for
the SQL Gateway.
If the gateway itself doesn't integrate well with the operator managed CRs
(sessionjobs) then I think it's better and simpler to have it separately.

These Helm charts should be part of the operator repo / examples with nice
docs. If we see that it's useful and popular we can start thinking of
integrating it into the CRD.

What do you think?
Gyula

On Tue, Sep 19, 2023 at 6:09 AM Yangze Guo  wrote:

> Thanks for the reply, @Gyula.
>
> I would like to first provide more context on OLAP scenarios. In OLAP
> scenarios, users typically submit multiple short batch jobs that have
> execution times typically measured in seconds or even sub-seconds.
> Additionally, due to the lightweight nature of these jobs, they often
> do not require lifecycle management features and can disable high
> availability functionalities such as failover and checkpointing.
>
> Regarding the integration issue, I believe that supporting the
> generation of FlinkSessionJob through a gateway is a "nice to have"
> feature rather than a "must-have." Firstly, it may be overkill to
> create a CRD for such lightweight jobs, and it could potentially
> impact the end-to-end execution time of the OLAP job. Secondly, as
> mentioned earlier, these jobs do not have strong lifecycle management
> requirements, so having an operator manage them would be a bit wasted.
> Therefore, atm, we can allow users to directly submit jobs using JDBC
> or REST API. WDYT?
>
> Best,
> Yangze Guo
>
> On Mon, Sep 18, 2023 at 4:08 PM Gyula Fóra  wrote:
> >
> > As I wrote in my previous answer, this could be done as a helm chart or
> as
> > part of the operator easily. Both would work.
> > My main concern for adding this into the operator is that the SQL Gateway
> > itself is not properly integrated with the Operator Custom resources.
> >
> > Gyula
> >
> > On Mon, Sep 18, 2023 at 4:24 AM Shammon FY  wrote:
> >
> > > Thanks @Gyula, I would like to share our use of sql-gateway with the
> Flink
> > > session cluster and I hope that it could help you to have a clearer
> > > understanding of our needs :)
> > >
> > > As @Yangze mentioned, currently we use flink as an olap platform by the
> > > following steps
> > > 1. Setup a flink session cluster by flink k8s session with k8s or zk
> > > highavailable.
> > > 2.  Write a Helm chart for Sql-Gateway image and launch multiple
> gateway
> > > instances to submit jobs to the same flink session cluster.
> > >
> > > As we mentioned in docs[1], we hope that users can easily launch
> > > sql-gateway instances in k8s. Does it only need to add a Helm chart for
> > > sql-gateway, or should we need to add this feature to the flink
> > > operator? Can you help give the conclusion? Thank you very much @Gyula
> > >
> > > [1]
> > >
> > >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/olap_quickstart/
> > >
> > > Best,
> > > Shammon FY
> > >
> > >
> > >
> > > On Sun, Sep 17, 2023 at 2:02 PM Gyula Fóra 
> wrote:
> > >
> > > > Hi!
> > > > It sounds pretty easy to deploy the gateway automatically with
> session
> > > > cluster deployments from the operator , but there is a major
> limitation
> > > > currently. The SQL gateway itself doesn't really support any operator
> > > > integration so jobs submitted through the SQL gateway would not be
> > > > manageable by the operator (they won't show up as session jobs).
> > > >
> > > > Without that, this is a very strange feature. We would make something
> > > much
> > > > easier for users that is not well supported by the operator in the
> first
> > > > place. The operator is designed to manage clusters and jobs
> > > > (FlinkDeployment / FlinkSessionJob). It would be good to understand
> if we
> > > > could make the SQL Gateway create a FlinkSessionJob / Deployment
> (that
> > > > would require application cluster support) and basically submit the
> job
> > > > through the operator.
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Sun, Sep 17, 2023 at 1:26 AM Yangze Guo 
> wrote:
> > > >
> > > > > > There would be many different ways of doing this. One gateway per
> > > > > session clust

Re: [Discuss] CRD for flink sql gateway in the flink k8s operator

2023-09-18 Thread Gyula Fóra

As I wrote in my previous answer, this could be done as a helm chart or as
part of the operator easily. Both would work.
My main concern for adding this into the operator is that the SQL Gateway
itself is not properly integrated with the Operator Custom resources.

Gyula

On Mon, Sep 18, 2023 at 4:24 AM Shammon FY  wrote:

> Thanks @Gyula, I would like to share our use of sql-gateway with the Flink
> session cluster and I hope that it could help you to have a clearer
> understanding of our needs :)
>
> As @Yangze mentioned, currently we use flink as an olap platform by the
> following steps
> 1. Setup a flink session cluster by flink k8s session with k8s or zk
> highavailable.
> 2.  Write a Helm chart for Sql-Gateway image and launch multiple gateway
> instances to submit jobs to the same flink session cluster.
>
> As we mentioned in docs[1], we hope that users can easily launch
> sql-gateway instances in k8s. Does it only need to add a Helm chart for
> sql-gateway, or should we need to add this feature to the flink
> operator? Can you help give the conclusion? Thank you very much @Gyula
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/olap_quickstart/
>
> Best,
> Shammon FY
>
>
>
> On Sun, Sep 17, 2023 at 2:02 PM Gyula Fóra  wrote:
>
> > Hi!
> > It sounds pretty easy to deploy the gateway automatically with session
> > cluster deployments from the operator , but there is a major limitation
> > currently. The SQL gateway itself doesn't really support any operator
> > integration so jobs submitted through the SQL gateway would not be
> > manageable by the operator (they won't show up as session jobs).
> >
> > Without that, this is a very strange feature. We would make something
> much
> > easier for users that is not well supported by the operator in the first
> > place. The operator is designed to manage clusters and jobs
> > (FlinkDeployment / FlinkSessionJob). It would be good to understand if we
> > could make the SQL Gateway create a FlinkSessionJob / Deployment (that
> > would require application cluster support) and basically submit the job
> > through the operator.
> >
> > Cheers,
> > Gyula
> >
> > On Sun, Sep 17, 2023 at 1:26 AM Yangze Guo  wrote:
> >
> > > > There would be many different ways of doing this. One gateway per
> > > session cluster, one gateway shared across different clusters...
> > >
> > > Currently, sql gateway cannot be shared across multiple clusters.
> > >
> > > > understand the tradeoff and the simplest way of accomplishing this.
> > >
> > > I'm not familiar with the Flink operator codebase, it would be
> > > appreciated if you could elaborate more on the cost of adding this
> > > feature. I agree that deploying a gateway using the native Kubernetes
> > > Deployment can be a simple way and straightforward for users. However,
> > > integrating it into an operator can provide additional benefits and be
> > > more user-friendly, especially for users who are less familiar with
> > > Kubernetes. By using an operator, users can benefit from consistent
> > > version management with the session cluster and upgrade capabilities.
> > >
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Fri, Sep 15, 2023 at 5:38 PM Gyula Fóra 
> wrote:
> > > >
> > > > There would be many different ways of doing this. One gateway per
> > session
> > > > cluster, one gateway shared across different clusters...
> > > > I would not rush to add anything anywhere until we understand the
> > > tradeoff
> > > > and the simplest way of accomplishing this.
> > > >
> > > > The operator already supports ingresses for session clusters so we
> > could
> > > > have a gateway sitting somewhere else simply using it.
> > > >
> > > > Gyula
> > > >
> > > > On Fri, Sep 15, 2023 at 10:18 AM Yangze Guo 
> > wrote:
> > > >
> > > > > Thanks for bringing this up, Dongwoo. Flink SQL Gateway is also a
> key
> > > > > component for OLAP scenarios.
> > > > >
> > > > > @Gyula
> > > > > How about add sql gateway as an optional component to Session
> Cluster
> > > > > Deployments. User can specify the resource / instance number and
> > ports
> > > > > of the sql gateway. I think that would help a lot for OLAP and
> batch
> > > > > user.
> > > > >
> > >

Re: [Discuss] CRD for flink sql gateway in the flink k8s operator

2023-09-17 Thread Gyula Fóra

Hi!
It sounds pretty easy to deploy the gateway automatically with session
cluster deployments from the operator , but there is a major limitation
currently. The SQL gateway itself doesn't really support any operator
integration so jobs submitted through the SQL gateway would not be
manageable by the operator (they won't show up as session jobs).

Without that, this is a very strange feature. We would make something much
easier for users that is not well supported by the operator in the first
place. The operator is designed to manage clusters and jobs
(FlinkDeployment / FlinkSessionJob). It would be good to understand if we
could make the SQL Gateway create a FlinkSessionJob / Deployment (that
would require application cluster support) and basically submit the job
through the operator.

Cheers,
Gyula

On Sun, Sep 17, 2023 at 1:26 AM Yangze Guo  wrote:

> > There would be many different ways of doing this. One gateway per
> session cluster, one gateway shared across different clusters...
>
> Currently, sql gateway cannot be shared across multiple clusters.
>
> > understand the tradeoff and the simplest way of accomplishing this.
>
> I'm not familiar with the Flink operator codebase, it would be
> appreciated if you could elaborate more on the cost of adding this
> feature. I agree that deploying a gateway using the native Kubernetes
> Deployment can be a simple way and straightforward for users. However,
> integrating it into an operator can provide additional benefits and be
> more user-friendly, especially for users who are less familiar with
> Kubernetes. By using an operator, users can benefit from consistent
> version management with the session cluster and upgrade capabilities.
>
>
> Best,
> Yangze Guo
>
> On Fri, Sep 15, 2023 at 5:38 PM Gyula Fóra  wrote:
> >
> > There would be many different ways of doing this. One gateway per session
> > cluster, one gateway shared across different clusters...
> > I would not rush to add anything anywhere until we understand the
> tradeoff
> > and the simplest way of accomplishing this.
> >
> > The operator already supports ingresses for session clusters so we could
> > have a gateway sitting somewhere else simply using it.
> >
> > Gyula
> >
> > On Fri, Sep 15, 2023 at 10:18 AM Yangze Guo  wrote:
> >
> > > Thanks for bringing this up, Dongwoo. Flink SQL Gateway is also a key
> > > component for OLAP scenarios.
> > >
> > > @Gyula
> > > How about add sql gateway as an optional component to Session Cluster
> > > Deployments. User can specify the resource / instance number and ports
> > > of the sql gateway. I think that would help a lot for OLAP and batch
> > > user.
> > >
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Fri, Sep 15, 2023 at 3:19 PM ConradJam  wrote:
> > > >
> > > > If we start from the crd direction, I think this mode is more like a
> > > > sidecar of the session cluster, which is submitted to the session
> cluster
> > > > by sending sql commands to the sql gateway. I don't know if my
> statement
> > > is
> > > > accurate.
> > > >
> > > > Xiaolong Wang  于2023年9月15日周五
> > > 13:27写道：
> > > >
> > > > > Hi, Dongwoo,
> > > > >
> > > > > Since Flink SQL gateway should run upon a Flink session cluster, I
> > > think
> > > > > it'd be easier to add more fields to the CRD of `FlinkSessionJob`.
> > > > >
> > > > > e.g.
> > > > >
> > > > > apiVersion: flink.apache.org/v1beta1
> > > > > kind: FlinkSessionJob
> > > > > metadata:
> > > > >   name: sql-gateway
> > > > > spec:
> > > > >   sqlGateway:
> > > > > endpoint: "hiveserver2"
> > > > > mode: "streaming"
> > > > > hiveConf:
> > > > >   configMap:
> > > > > name: hive-config
> > > > > items:
> > > > >   - key: hive-site.xml
> > > > > path: hive-site.xml
> > > > >
> > > > >
> > > > > On Fri, Sep 15, 2023 at 12:56 PM Dongwoo Kim <
> dongwoo7@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > *@Gyula*
> > > > > > Thanks for the consideration Gyula. My initial idea for the CR
> was
> > > > > roughly
> > > > > &

Re: [Discuss] CRD for flink sql gateway in the flink k8s operator

2023-09-15 Thread Gyula Fóra

ally a work-around and allows
> me to
> > > > > start a Flink session job manager with a SQL gateway running upon.
> > > > >
> > > > > I agree that it'd be more elegant that we create a new job type and
> > > > write a
> > > > > script, which is much easier for the user to use (since they do not
> > > need
> > > > to
> > > > > build a separate Flink image any more).
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Sep 15, 2023 at 10:29 AM Shammon FY 
> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Currently `sql-gateway` can be started with the script
> > > `sql-gateway.sh`
> > > > > in
> > > > > > an existing node, it is more like a simple "standalone" node. I
> think
> > > > > it's
> > > > > > valuable if we can do more work to start it in k8s.
> > > > > >
> > > > > > For xiaolong:
> > > > > > Do you want to start a sql-gateway instance in the jobmanager
> pod? I
> > > > > think
> > > > > > maybe we need a script like `kubernetes-sql-gatewah.sh` to start
> > > > > > `sql-gateway` pods with a flink image, what do you think?
> > > > > >
> > > > > > Best,
> > > > > > Shammon FY
> > > > > >
> > > > > >
> > > > > > On Fri, Sep 15, 2023 at 10:02 AM Xiaolong Wang
> > > > > >  wrote:
> > > > > >
> > > > > > > Hi, I've experiment this feature on K8S recently, here is some
> of
> > > my
> > > > > > trial:
> > > > > > >
> > > > > > >
> > > > > > > 1. Create a new kubernetes-jobmanager.sh script with the
> following
> > > > > > content
> > > > > > >
> > > > > > > #!/usr/bin/env bash
> > > > > > > $FLINK_HOME/bin/sql-gateway.sh start
> > > > > > > $FLINK_HOME/bin/kubernetes-jobmanager1.sh kubernetes-session
> > > > > > >
> > > > > > > 2. Build your own Flink docker image something like this
> > > > > > > FROM flink:1.17.1-scala_2.12-java11
> > > > > > >
> > > > > > > RUN mv $FLINK_HOME/bin/kubernetes-jobmanager.sh
> $FLINK_HOME/bin/
> > > > > > > kubernetes-jobmanager1.sh
> > > > > > > COPY ./kubernetes-jobmanager.sh
> > > > > $FLINK_HOME/bin/kubernetes-jobmanager.sh
> > > > > > >
> > > > > > > RUN chmod +x $FLINK_HOME/bin/*.sh
> > > > > > > USER flink
> > > > > > >
> > > > > > > 3. Create a Flink session job with the operator using the above
> > > > image.
> > > > > > >
> > > > > > > On Thu, Sep 14, 2023 at 9:49 PM Gyula Fóra <
> gyula.f...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi!
> > > > > > > >
> > > > > > > > I don't completely understand what would be a content of such
> > > CRD,
> > > > > > could
> > > > > > > > you give a minimal example how the Flink SQL Gateway CR yaml
> > > would
> > > > > look
> > > > > > > > like?
> > > > > > > >
> > > > > > > > Adding a CRD would mean you need to add some
> operator/controller
> > > > > logic
> > > > > > as
> > > > > > > > well. Why not simply use a Deployment / StatefulSet in
> > > Kubernetes?
> > > > > > > >
> > > > > > > > Or a Helm chart if you want to make it more user friendly?
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Thu, Sep 14, 2023 at 12:57 PM Dongwoo Kim <
> > > > dongwoo7@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > I've been working on setting up a flink SQL gateway in a
> k8s
> > > > > > > environment
> > > > > > > > > and it got me thinking — what if we had a CRD for this?
> > > > > > > > >
> > > > > > > > > So I have quick questions below.
> > > > > > > > > 1. Is there ongoing work to create a CRD for the Flink SQL
> > > > Gateway?
> > > > > > > > > 2. If not, would the community be open to considering a
> CRD for
> > > > > this?
> > > > > > > > >
> > > > > > > > > I've noticed a growing demand for simplified setup of the
> flink
> > > > sql
> > > > > > > > gateway
> > > > > > > > > in flink's slack channel.
> > > > > > > > > Implementing a CRD could make deployments easier and offer
> > > better
> > > > > > > > > integration with k8s.
> > > > > > > > >
> > > > > > > > > If this idea is accepted, I'm open to drafting a FLIP for
> > > further
> > > > > > > > > discussion
> > > > > > > > >
> > > > > > > > > Thanks for your time and looking forward to your thoughts!
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Dongwoo
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best
> >
> > ConradJam
>

Re: [DISSCUSS] Kubernetes Operator Flink Version Support Policy

2023-09-14 Thread Gyula Fóra

https://flink.apache.org/downloads/#update-policy-for-old-releases

On Thu, Sep 14, 2023 at 8:47 PM Jing Ge  wrote:

> +1 Thanks! I have an off-track question: where could we find the reference
> that the community only supports the last 2 minor releases? Thanks!
>
>
> Best Regards,
> Jing
>
> On Thu, Sep 14, 2023 at 3:13 PM Ahmed Hamdy  wrote:
>
> > Makes sense,
> > Thanks for the clarification.
> > Best Regards
> > Ahmed Hamdy
> >
> >
> > On Thu, 14 Sept 2023 at 14:07, Gyula Fóra  wrote:
> >
> > > Hi Ahmed!
> > >
> > > As I mentioned in the first email, the Flink Operator explicitly aims
> to
> > > make running Flink and Flink Platforms on Kubernetes easy. As most
> users
> > > are platform teams supporting Flink inside a company or running a
> service
> > > it's basically always required to support several Flink versions at the
> > > same time.
> > >
> > > Enterprise users are in many cases using Flink versions that are older
> > than
> > > the last 2 minor releases (currently supported by the community).
> However
> > > the operator itself is somewhat independent of Flink itself, and most
> > > operator features work across several Flink versions at the same time.
> > >
> > > Based on this it's relatively easy for us to support deploying to
> > previous
> > > Flink minor versions (within some reasonable limit). This means that as
> > > long as platform teams keep the operator up-to-date they get the latest
> > > stability / deployment improvements but can still provide compatibility
> > for
> > > their users for older Flink job versions.
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Thu, Sep 14, 2023 at 2:59 PM Ahmed Hamdy 
> > wrote:
> > >
> > > > Thanks Gyula,
> > > > +1 for the proposal in general.
> > > > May I ask why are we interested in supporting more than the ones
> > > supported
> > > > by the community?
> > > > for example I understand all versions prior to 1.16 are now out of
> > > support,
> > > > why should we tie our compatibility 4 versions behind?
> > > > Best Regards
> > > > Ahmed Hamdy
> > > >
> > > >
> > > > On Thu, 14 Sept 2023 at 12:18, ConradJam 
> wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Yang Wang  于2023年9月14日周四 16:15写道：
> > > > >
> > > > > > Since the users could always use the old Flink Kubernetes
> Operator
> > > > > version
> > > > > > along with old Flink versions, I am totally in favor of this
> > proposal
> > > > to
> > > > > > reduce maintenance burden.
> > > > > >
> > > > > > Best,
> > > > > > Yang
> > > > > >
> > > > > > Biao Geng  于2023年9月6日周三 18:15写道：
> > > > > >
> > > > > > > +1 for the proposal.
> > > > > > >
> > > > > > > Best,
> > > > > > > Biao Geng
> > > > > > >
> > > > > > > Gyula Fóra  于2023年9月6日周三 16:10写道：
> > > > > > >
> > > > > > > > @Zhanghao Chen:
> > > > > > > >
> > > > > > > > I am not completely sure at this point what this will mean
> for
> > > 2.0
> > > > > > simply
> > > > > > > > because I am also not sure what that will mean for the
> operator
> > > as
> > > > > well
> > > > > > > :)
> > > > > > > > I think this will depend on the compatibility guarantees we
> can
> > > > > provide
> > > > > > > > across Flink major versions in general. We have to look into
> > that
> > > > and
> > > > > > > > tackle the question there independently.
> > > > > > > >
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Tue, Sep 5, 2023 at 6:12 PM Maximilian Michels <
> > > m...@apache.org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 Sounds good! Four releases give a decent amount of time
> to
> > > > > migrate
> > > > > > > > > to the next Fli

Re: [Discuss] CRD for flink sql gateway in the flink k8s operator

2023-09-14 Thread Gyula Fóra

Hi!

I don't completely understand what would be a content of such CRD, could
you give a minimal example how the Flink SQL Gateway CR yaml would look
like?

Adding a CRD would mean you need to add some operator/controller logic as
well. Why not simply use a Deployment / StatefulSet in Kubernetes?

Or a Helm chart if you want to make it more user friendly?

Cheers,
Gyula

On Thu, Sep 14, 2023 at 12:57 PM Dongwoo Kim  wrote:

> Hi all,
>
> I've been working on setting up a flink SQL gateway in a k8s environment
> and it got me thinking — what if we had a CRD for this?
>
> So I have quick questions below.
> 1. Is there ongoing work to create a CRD for the Flink SQL Gateway?
> 2. If not, would the community be open to considering a CRD for this?
>
> I've noticed a growing demand for simplified setup of the flink sql gateway
> in flink's slack channel.
> Implementing a CRD could make deployments easier and offer better
> integration with k8s.
>
> If this idea is accepted, I'm open to drafting a FLIP for further
> discussion
>
> Thanks for your time and looking forward to your thoughts!
>
> Best regards,
> Dongwoo
>

Re: [DISSCUSS] Kubernetes Operator Flink Version Support Policy

2023-09-14 Thread Gyula Fóra

Hi Ahmed!

As I mentioned in the first email, the Flink Operator explicitly aims to
make running Flink and Flink Platforms on Kubernetes easy. As most users
are platform teams supporting Flink inside a company or running a service
it's basically always required to support several Flink versions at the
same time.

Enterprise users are in many cases using Flink versions that are older than
the last 2 minor releases (currently supported by the community). However
the operator itself is somewhat independent of Flink itself, and most
operator features work across several Flink versions at the same time.

Based on this it's relatively easy for us to support deploying to previous
Flink minor versions (within some reasonable limit). This means that as
long as platform teams keep the operator up-to-date they get the latest
stability / deployment improvements but can still provide compatibility for
their users for older Flink job versions.

Cheers,
Gyula

On Thu, Sep 14, 2023 at 2:59 PM Ahmed Hamdy  wrote:

> Thanks Gyula,
> +1 for the proposal in general.
> May I ask why are we interested in supporting more than the ones supported
> by the community?
> for example I understand all versions prior to 1.16 are now out of support,
> why should we tie our compatibility 4 versions behind?
> Best Regards
> Ahmed Hamdy
>
>
> On Thu, 14 Sept 2023 at 12:18, ConradJam  wrote:
>
> > +1
> >
> > Yang Wang  于2023年9月14日周四 16:15写道：
> >
> > > Since the users could always use the old Flink Kubernetes Operator
> > version
> > > along with old Flink versions, I am totally in favor of this proposal
> to
> > > reduce maintenance burden.
> > >
> > > Best,
> > > Yang
> > >
> > > Biao Geng  于2023年9月6日周三 18:15写道：
> > >
> > > > +1 for the proposal.
> > > >
> > > > Best,
> > > > Biao Geng
> > > >
> > > > Gyula Fóra  于2023年9月6日周三 16:10写道：
> > > >
> > > > > @Zhanghao Chen:
> > > > >
> > > > > I am not completely sure at this point what this will mean for 2.0
> > > simply
> > > > > because I am also not sure what that will mean for the operator as
> > well
> > > > :)
> > > > > I think this will depend on the compatibility guarantees we can
> > provide
> > > > > across Flink major versions in general. We have to look into that
> and
> > > > > tackle the question there independently.
> > > > >
> > > > > Gyula
> > > > >
> > > > > On Tue, Sep 5, 2023 at 6:12 PM Maximilian Michels 
> > > > wrote:
> > > > >
> > > > > > +1 Sounds good! Four releases give a decent amount of time to
> > migrate
> > > > > > to the next Flink version.
> > > > > >
> > > > > > On Tue, Sep 5, 2023 at 5:33 PM Őrhidi Mátyás <
> > > matyas.orh...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > On Tue, Sep 5, 2023 at 8:03 AM Thomas Weise 
> > > wrote:
> > > > > > >
> > > > > > > > +1, thanks for the proposal
> > > > > > > >
> > > > > > > > On Tue, Sep 5, 2023 at 8:13 AM Gyula Fóra <
> > gyula.f...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi All!
> > > > > > > > >
> > > > > > > > > @Maximilian Michels  has raised the
> question
> > > of
> > > > > > Flink
> > > > > > > > > version support in the operator before the last release. I
> > > would
> > > > > > like to
> > > > > > > > > open this discussion publicly so we can finalize this
> before
> > > the
> > > > > next
> > > > > > > > > release.
> > > > > > > > >
> > > > > > > > > Background:
> > > > > > > > > Currently the Flink Operator supports all Flink versions
> > since
> > > > > Flink
> > > > > > > > 1.13.
> > > > > > > > > While this is great for the users, it introduces a lot of
> > > > backward
> > > > > > > > > compatibility related code in the operator logic and also
> > adds
> > > > > > > > considerable
> > > > > > > > > time to the CI. We should strike a reasonable balance here
> > that
> > > > > > allows us
> > > > > > > > > to move forward and eliminate some of this tech debt.
> > > > > > > > >
> > > > > > > > > In the current model it is also impossible to support all
> > > > features
> > > > > > for
> > > > > > > > all
> > > > > > > > > Flink versions which leads to some confusion over time.
> > > > > > > > >
> > > > > > > > > Proposal:
> > > > > > > > > Since it's a key feature of the kubernetes operator to
> > support
> > > > > > several
> > > > > > > > > versions at the same time, I propose to support the last 4
> > > stable
> > > > > > Flink
> > > > > > > > > minor versions. Currently this would mean to support Flink
> > > > > 1.14-1.17
> > > > > > (and
> > > > > > > > > drop 1.13 support). When Flink 1.18 is released we would
> drop
> > > > 1.14
> > > > > > > > support
> > > > > > > > > and so on. Given the Flink release cadence this means
> about 2
> > > > year
> > > > > > > > support
> > > > > > > > > for each Flink version.
> > > > > > > > >
> > > > > > > > > What do you think?
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-13 Thread Gyula Fóra

Hi Venkata,

Unfortunately the GC CPU Usage , percentage or otherwise, is not something
that the JVM GC beans expose. So I am afraid we cannot easily add that.

Regards
Gyula

On Wed, Sep 13, 2023 at 7:59 PM Venkatakrishnan Sowrirajan 
wrote:

> Hi Gyula,
>
> Thanks for driving this FLIP.
>
> The proposal looks good to me. Only one minor suggestion I have is, can we
> also include the % GC time spent wrt the overall CPU time especially useful
> in the cases of TM which helps in easily identifying issues related to GC.
> Thoughts?
>
> Regards
> Venkata krishnan
>
>
> On Wed, Sep 13, 2023 at 6:13 AM Gyula Fóra  wrote:
>
> > Thanks for all the feedback, I will start the vote on this.
> >
> > Gyula
> >
> > On Wed, Sep 6, 2023 at 11:11 AM Xintong Song 
> > wrote:
> >
> > > >
> > > > I added the average time metric to the FLIP document. I also included
> > it
> > > > for the aggregate (total) across all collectors. But maybe it doesn't
> > > make
> > > > too much sense as collection times usually differ greatly depending
> on
> > > the
> > > > collector.
> > > >
> > >
> > > LGTM
> > >
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Wed, Sep 6, 2023 at 4:31 PM Gyula Fóra 
> wrote:
> > >
> > > > I added the average time metric to the FLIP document. I also included
> > it
> > > > for the aggregate (total) across all collectors. But maybe it doesn't
> > > make
> > > > too much sense as collection times usually differ greatly depending
> on
> > > the
> > > > collector.
> > > >
> > > > Gyula
> > > >
> > > > On Wed, Sep 6, 2023 at 10:21 AM Xintong Song 
> > > > wrote:
> > > >
> > > > > Thank you :)
> > > > >
> > > > > Best,
> > > > >
> > > > > Xintong
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Sep 6, 2023 at 4:17 PM Gyula Fóra 
> > > wrote:
> > > > >
> > > > > > Makes sense Xintong, I am happy to extend the proposal with the
> > > average
> > > > > gc
> > > > > > time metric +1
> > > > > >
> > > > > > Gyula
> > > > > >
> > > > > > On Wed, Sep 6, 2023 at 10:09 AM Xintong Song <
> > tonysong...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > >
> > > > > > > > Just so I understand correctly, do you suggest adding a
> metric
> > > for
> > > > > > > > delta(Time) / delta(Count) since the last reporting ?
> > > > > > > > .TimePerGc or .AverageTime would make
> > > sense.
> > > > > > > > AverageTime may be a bit nicer :)
> > > > > > > >
> > > > > > >
> > > > > > > Yes, that's what I mean.
> > > > > > >
> > > > > > > My only concern is how useful this will be in reality. If there
> > are
> > > > > only
> > > > > > > > (or several) long pauses then the msPerSec metrics will show
> it
> > > > > > already,
> > > > > > > > and if there is a single long pause that may not be shown at
> > all
> > > if
> > > > > > there
> > > > > > > > are several shorter pauses as well with this metric.
> > > > > > >
> > > > > > >
> > > > > > > Let's say we measure this for every minute and see a 900
> msPerSec
> > > > > (which
> > > > > > > means 54s within the minute are spent on GC). This may come
> from
> > a
> > > > > single
> > > > > > > GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more
> > GCs
> > > > with
> > > > > > > less time each. As the default heartbeat timeout is 50s, the
> > former
> > > > > means
> > > > > > > there's likely a heartbeat timeout due to the GC pause, while
> the
> > > > > latter
> > > > > > > means otherwise.
> > > > > > >
> > > > > > >
> > > > > > > Mathematically, it is possible tha

[VOTE] FLIP-361: Improve GC Metrics

2023-09-13 Thread Gyula Fóra

Hi All!

Thanks for all the feedback on FLIP-361: Improve GC Metrics [1][2]

I'd like to start a vote for it. The vote will be open for at least 72
hours unless there is an objection or insufficient votes.

Cheers,
Gyula

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
[2] https://lists.apache.org/thread/qqqv54vyr4gbp63wm2d12q78m8h95xb2

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-13 Thread Gyula Fóra

Thanks for all the feedback, I will start the vote on this.

Gyula

On Wed, Sep 6, 2023 at 11:11 AM Xintong Song  wrote:

> >
> > I added the average time metric to the FLIP document. I also included it
> > for the aggregate (total) across all collectors. But maybe it doesn't
> make
> > too much sense as collection times usually differ greatly depending on
> the
> > collector.
> >
>
> LGTM
>
>
> Best,
>
> Xintong
>
>
>
> On Wed, Sep 6, 2023 at 4:31 PM Gyula Fóra  wrote:
>
> > I added the average time metric to the FLIP document. I also included it
> > for the aggregate (total) across all collectors. But maybe it doesn't
> make
> > too much sense as collection times usually differ greatly depending on
> the
> > collector.
> >
> > Gyula
> >
> > On Wed, Sep 6, 2023 at 10:21 AM Xintong Song 
> > wrote:
> >
> > > Thank you :)
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Wed, Sep 6, 2023 at 4:17 PM Gyula Fóra 
> wrote:
> > >
> > > > Makes sense Xintong, I am happy to extend the proposal with the
> average
> > > gc
> > > > time metric +1
> > > >
> > > > Gyula
> > > >
> > > > On Wed, Sep 6, 2023 at 10:09 AM Xintong Song 
> > > > wrote:
> > > >
> > > > > >
> > > > > > Just so I understand correctly, do you suggest adding a metric
> for
> > > > > > delta(Time) / delta(Count) since the last reporting ?
> > > > > > .TimePerGc or .AverageTime would make
> sense.
> > > > > > AverageTime may be a bit nicer :)
> > > > > >
> > > > >
> > > > > Yes, that's what I mean.
> > > > >
> > > > > My only concern is how useful this will be in reality. If there are
> > > only
> > > > > > (or several) long pauses then the msPerSec metrics will show it
> > > > already,
> > > > > > and if there is a single long pause that may not be shown at all
> if
> > > > there
> > > > > > are several shorter pauses as well with this metric.
> > > > >
> > > > >
> > > > > Let's say we measure this for every minute and see a 900 msPerSec
> > > (which
> > > > > means 54s within the minute are spent on GC). This may come from a
> > > single
> > > > > GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more GCs
> > with
> > > > > less time each. As the default heartbeat timeout is 50s, the former
> > > means
> > > > > there's likely a heartbeat timeout due to the GC pause, while the
> > > latter
> > > > > means otherwise.
> > > > >
> > > > >
> > > > > Mathematically, it is possible that there's 1 long pause together
> > with
> > > > > several short pauses within the same measurement period, making the
> > > long
> > > > > pause not observable with AverageTime. However, from my experience,
> > > such
> > > > a
> > > > > pattern is not normal in reality. My observation is that GCs happen
> > at
> > > a
> > > > > similar time usually take a similar length of time. Admittedly,
> this
> > is
> > > > not
> > > > > a hard guarantee.
> > > > >
> > > > >
> > > > > Best,
> > > > >
> > > > > Xintong
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra 
> > > wrote:
> > > > >
> > > > > > Matt Wang,
> > > > > >
> > > > > > I think the currently exposed info is all that is available
> through
> > > > > > GarbageCollectorMXBean. This FLIP does not aim to introduce a new
> > > more
> > > > > > granular way of reporting the per collector metrics, that would
> > > > require a
> > > > > > new mechanism and may be a breaking change.
> > > > > >
> > > > > > We basically want to simply extend the current reporting here
> with
> > > the
> > > > > rate
> > > > > > metrics and the total metrics.
> > > > > >
> > > > > > Gyula
> > > > > >
&g

Re: [VOTE] FLIP-334: Decoupling autoscaler and kubernetes and support the Standalone Autoscaler

2023-09-13 Thread Gyula Fóra

+1 (binding)

Gyula

On Wed, 13 Sep 2023 at 09:33, Matt Wang  wrote:

> Thank you for driving this FLIP,
>
> +1 (non-binding)
>
>
> --
>
> Best,
> Matt Wang
>
>
>  Replied Message 
> | From | ConradJam |
> | Date | 09/13/2023 15:28 |
> | To |  |
> | Subject | Re: [VOTE] FLIP-334: Decoupling autoscaler and kubernetes and
> support the Standalone Autoscaler |
> best idea
> +1 (non-binding)
>
> Ahmed Hamdy  于2023年9月13日周三 15:23写道：
>
> Hi Rui,
> I have gone through the thread.
> +1 (non-binding)
>
> Best Regards
> Ahmed Hamdy
>
>
> On Wed, 13 Sept 2023 at 03:53, Rui Fan <1996fan...@gmail.com> wrote:
>
> Hi all,
>
> Thanks for all the feedback about the FLIP-334:
> Decoupling autoscaler and kubernetes and
> support the Standalone Autoscaler[1].
> This FLIP was discussed in [2].
>
> I'd like to start a vote for it. The vote will be open for at least 72
> hours (until Sep 16th 11:00 UTC+8) unless there is an objection or
> insufficient votes.
>
> [1]
>
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-334+%3A+Decoupling+autoscaler+and+kubernetes+and+support+the+Standalone+Autoscaler
> [2] https://lists.apache.org/thread/kmm03gls1vw4x6vk1ypr9ny9q9522495
>
> Best,
> Rui
>
>
>
>
> --
> Best
>
> ConradJam
>

1 2 3 4 5 6 7 >

1 - 100 of 665 matches

Mail list logo