[ANNOUNCE] Apache Flink Kubernetes Operator 1.8.0 released

2024-03-25 Thread Maximilian Michels
The Apache Flink community is very happy to announce the release of
the Apache Flink Kubernetes Operator version 1.8.0.

The Flink Kubernetes Operator allows users to manage their Apache
Flink applications on Kubernetes through all aspects of their
lifecycle.

Release highlights:
- Flink Autotuning automatically adjusts TaskManager memory
- Flink Autoscaling metrics and decision accuracy improved
- Improve standalone Flink Autoscaling
- Savepoint trigger nonce for savepoint-based restarts
- Operator stability improvements for cluster shutdown

Blog post: 
https://flink.apache.org/2024/03/21/apache-flink-kubernetes-operator-1.8.0-release-announcement/

The release is available for download at:
https://flink.apache.org/downloads.html

Maven artifacts for Flink Kubernetes Operator can be found at:
https://search.maven.org/artifact/org.apache.flink/flink-kubernetes-operator

Official Docker image for Flink Kubernetes Operator can be found at:
https://hub.docker.com/r/apache/flink-kubernetes-operator

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12353866&projectId=12315522

We would like to thank the Apache Flink community and its contributors
who made this release possible!

Cheers,
Max


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.8.0, release candidate #1

2024-03-21 Thread Maximilian Michels
The vote is now closed.

I'm happy to announce that we have unanimously approved this release.

There are 6 approving votes, 3 of which are binding:

* Gyula Fora (binding)
* Marton Balassi (binding)
* Maximilian Michels (binding)
* Rui Fan (non-binding)
* Alexander Fedulov (non-binding)
* Mate Czagany (non-binding)

There are no disapproving votes.

Thank you all for voting!

-Max

On Thu, Mar 21, 2024 at 3:49 PM Maximilian Michels  wrote:
>
> +1 (binding)
>
> 1. Verified the archives, checksums, and signatures
> 2. Extracted and inspected the source code for binaries
> 3. Compiled and tested the source code via mvn verify
> 4. Verified license files / headers
> 5. Deployed helm chart to test cluster
> 6. Ran example job
> 7. Tested autoscaling without resource requirements API
> 8. Tested autotuning
>
> -Max
>
> On Thu, Mar 21, 2024 at 8:50 AM Márton Balassi  
> wrote:
> >
> > +1 (binding)
> >
> > As per Gyula's suggestion above verified with "
> > ghcr.io/apache/flink-kubernetes-operator:91d67d9 ".
> >
> > - Verified Helm repo works as expected, points to correct image tag, build,
> > version
> > - Verified basic examples + checked operator logs everything looks as
> > expected
> > - Verified hashes, signatures and source release contains no binaries
> > - Ran built-in tests, built jars + docker image from source successfully
> > - Upgraded the operator and the CRD from 1.7.0 to 1.8.0
> >
> > Best,
> > Marton
> >
> > On Wed, Mar 20, 2024 at 9:10 PM Mate Czagany  wrote:
> >
> > > Hi,
> > >
> > > +1 (non-binding)
> > >
> > > - Verified checksums
> > > - Verified signatures
> > > - Verified no binaries in source distribution
> > > - Verified Apache License and NOTICE files
> > > - Executed tests
> > > - Built container image
> > > - Verified chart version and appVersion matches
> > > - Verified Helm chart can be installed with default values
> > > - Verify that RC repo works as Helm repo
> > >
> > > Best Regards,
> > > Mate
> > >
> > > Alexander Fedulov  ezt írta (időpont: 2024.
> > > márc. 19., K, 23:10):
> > >
> > > > Hi Max,
> > > >
> > > > +1
> > > >
> > > > - Verified SHA checksums
> > > > - Verified GPG signatures
> > > > - Verified that the source distributions do not contain binaries
> > > > - Verified built-in tests (mvn clean verify)
> > > > - Verified build with Java 11 (mvn clean install -DskipTests -T 1C)
> > > > - Verified that Helm and operator files contain Apache licenses (rg -L
> > > > --files-without-match "http://www.apache.org/licenses/LICENSE-2.0"; .).
> > > >  I am not sure we need to
> > > > include ./examples/flink-beam-example/dependency-reduced-pom.xml
> > > > and ./flink-autoscaler-standalone/dependency-reduced-pom.xml though
> > > > - Verified that chart and appVersion matches the target release 
> > > > (91d67d9)
> > > > - Verified that Helm chart can be installed from the local Helm folder
> > > > without overriding any parameters
> > > > - Verified that Helm chart can be installed from the RC repo without
> > > > overriding any parameters (
> > > >
> > > >
> > > https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.8.0-rc1
> > > > )
> > > > - Verified docker container build
> > > >
> > > > Best,
> > > > Alex
> > > >
> > > >
> > > > On Mon, 18 Mar 2024 at 20:50, Maximilian Michels  
> > > > wrote:
> > > >
> > > > > @Rui @Gyula Thanks for checking the release!
> > > > >
> > > > > >A minor correction is that [3] in the email should point to:
> > > > > >ghcr.io/apache/flink-kubernetes-operator:91d67d9 . But the helm chart
> > > > and
> > > > > > everything is correct. It's a typo in the vote email.
> > > > >
> > > > > Good catch. Indeed, for the linked Docker image 8938658 points to
> > > > > HEAD^ of the rc branch, 91d67d9 is the HEAD. There are no code changes
> > > > > between those two commits, except for updating the version. So the
> > > > > votes are not impacted, especially because votes are casted against
> > > > > the source release which, as you pointed out, contains the correct
> > > >

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.8.0, release candidate #1

2024-03-21 Thread Maximilian Michels
+1 (binding)

1. Verified the archives, checksums, and signatures
2. Extracted and inspected the source code for binaries
3. Compiled and tested the source code via mvn verify
4. Verified license files / headers
5. Deployed helm chart to test cluster
6. Ran example job
7. Tested autoscaling without resource requirements API
8. Tested autotuning

-Max

On Thu, Mar 21, 2024 at 8:50 AM Márton Balassi  wrote:
>
> +1 (binding)
>
> As per Gyula's suggestion above verified with "
> ghcr.io/apache/flink-kubernetes-operator:91d67d9 ".
>
> - Verified Helm repo works as expected, points to correct image tag, build,
> version
> - Verified basic examples + checked operator logs everything looks as
> expected
> - Verified hashes, signatures and source release contains no binaries
> - Ran built-in tests, built jars + docker image from source successfully
> - Upgraded the operator and the CRD from 1.7.0 to 1.8.0
>
> Best,
> Marton
>
> On Wed, Mar 20, 2024 at 9:10 PM Mate Czagany  wrote:
>
> > Hi,
> >
> > +1 (non-binding)
> >
> > - Verified checksums
> > - Verified signatures
> > - Verified no binaries in source distribution
> > - Verified Apache License and NOTICE files
> > - Executed tests
> > - Built container image
> > - Verified chart version and appVersion matches
> > - Verified Helm chart can be installed with default values
> > - Verify that RC repo works as Helm repo
> >
> > Best Regards,
> > Mate
> >
> > Alexander Fedulov  ezt írta (időpont: 2024.
> > márc. 19., K, 23:10):
> >
> > > Hi Max,
> > >
> > > +1
> > >
> > > - Verified SHA checksums
> > > - Verified GPG signatures
> > > - Verified that the source distributions do not contain binaries
> > > - Verified built-in tests (mvn clean verify)
> > > - Verified build with Java 11 (mvn clean install -DskipTests -T 1C)
> > > - Verified that Helm and operator files contain Apache licenses (rg -L
> > > --files-without-match "http://www.apache.org/licenses/LICENSE-2.0"; .).
> > >  I am not sure we need to
> > > include ./examples/flink-beam-example/dependency-reduced-pom.xml
> > > and ./flink-autoscaler-standalone/dependency-reduced-pom.xml though
> > > - Verified that chart and appVersion matches the target release (91d67d9)
> > > - Verified that Helm chart can be installed from the local Helm folder
> > > without overriding any parameters
> > > - Verified that Helm chart can be installed from the RC repo without
> > > overriding any parameters (
> > >
> > >
> > https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.8.0-rc1
> > > )
> > > - Verified docker container build
> > >
> > > Best,
> > > Alex
> > >
> > >
> > > On Mon, 18 Mar 2024 at 20:50, Maximilian Michels  wrote:
> > >
> > > > @Rui @Gyula Thanks for checking the release!
> > > >
> > > > >A minor correction is that [3] in the email should point to:
> > > > >ghcr.io/apache/flink-kubernetes-operator:91d67d9 . But the helm chart
> > > and
> > > > > everything is correct. It's a typo in the vote email.
> > > >
> > > > Good catch. Indeed, for the linked Docker image 8938658 points to
> > > > HEAD^ of the rc branch, 91d67d9 is the HEAD. There are no code changes
> > > > between those two commits, except for updating the version. So the
> > > > votes are not impacted, especially because votes are casted against
> > > > the source release which, as you pointed out, contains the correct
> > > > image ref.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Mar 18, 2024 at 9:54 AM Gyula Fóra 
> > wrote:
> > > > >
> > > > > Hi Max!
> > > > >
> > > > > +1 (binding)
> > > > >
> > > > >  - Verified source release, helm chart + checkpoints / signatures
> > > > >  - Helm points to correct image
> > > > >  - Deployed operator, stateful example and executed upgrade +
> > savepoint
> > > > > redeploy
> > > > >  - Verified logs
> > > > >  - Flink web PR looks good +1
> > > > >
> > > > > A minor correction is that [3] in the email should point to:
> > > > > ghcr.io/apache/flink-kub

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.8.0, release candidate #1

2024-03-18 Thread Maximilian Michels
@Rui @Gyula Thanks for checking the release!

>A minor correction is that [3] in the email should point to:
>ghcr.io/apache/flink-kubernetes-operator:91d67d9 . But the helm chart and
> everything is correct. It's a typo in the vote email.

Good catch. Indeed, for the linked Docker image 8938658 points to
HEAD^ of the rc branch, 91d67d9 is the HEAD. There are no code changes
between those two commits, except for updating the version. So the
votes are not impacted, especially because votes are casted against
the source release which, as you pointed out, contains the correct
image ref.










On Mon, Mar 18, 2024 at 9:54 AM Gyula Fóra  wrote:
>
> Hi Max!
>
> +1 (binding)
>
>  - Verified source release, helm chart + checkpoints / signatures
>  - Helm points to correct image
>  - Deployed operator, stateful example and executed upgrade + savepoint
> redeploy
>  - Verified logs
>  - Flink web PR looks good +1
>
> A minor correction is that [3] in the email should point to:
> ghcr.io/apache/flink-kubernetes-operator:91d67d9 . But the helm chart and
> everything is correct. It's a typo in the vote email.
>
> Thank you for preparing the release!
>
> Cheers,
> Gyula
>
> On Mon, Mar 18, 2024 at 8:26 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Thanks Max for driving this release!
> >
> > +1(non-binding)
> >
> > - Downloaded artifacts from dist ( svn co
> >
> > https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.8.0-rc1/
> > )
> > - Verified SHA512 checksums : ( for i in *.tgz; do echo $i; sha512sum
> > --check $i.sha512; done )
> > - Verified GPG signatures : ( $ for i in *.tgz; do echo $i; gpg --verify
> > $i.asc $i )
> > - Build the source with java-11 and java-17 ( mvn -T 20 clean install
> > -DskipTests )
> > - Verified the license header during build the source
> > - Verified that chart and appVersion matches the target release (less the
> > index.yaml and Chart.yaml )
> > - RC repo works as Helm repo( helm repo add flink-operator-repo-1.8.0-rc1
> >
> > https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.8.0-rc1/
> > )
> > - Verified Helm chart can be installed  ( helm install
> > flink-kubernetes-operator
> > flink-operator-repo-1.8.0-rc1/flink-kubernetes-operator --set
> > webhook.create=false )
> > - Submitted the autoscaling demo, the autoscaler works well with *memory
> > tuning *(kubectl apply -f autoscaling.yaml)
> >- job.autoscaler.memory.tuning.enabled: "true"
> > - Download Autoscaler standalone: wget
> >
> > https://repository.apache.org/content/repositories/orgapacheflink-1710/org/apache/flink/flink-autoscaler-standalone/1.8.0/flink-autoscaler-standalone-1.8.0.jar
> > - Ran Autoscaler standalone locally, it works well with rescale api and
> > JDBC state store/event handler
> >
> > Best,
> > Rui
> >
> > On Fri, Mar 15, 2024 at 1:45 AM Maximilian Michels  wrote:
> >
> > > Hi everyone,
> > >
> > > Please review and vote on the release candidate #1 for the version
> > > 1.8.0 of the Apache Flink Kubernetes Operator, as follows:
> > >
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > > **Release Overview**
> > >
> > > As an overview, the release consists of the following:
> > > a) Kubernetes Operator canonical source distribution (including the
> > > Dockerfile), to be deployed to the release repository at
> > > dist.apache.org
> > > b) Kubernetes Operator Helm Chart to be deployed to the release
> > > repository at dist.apache.org
> > > c) Maven artifacts to be deployed to the Maven Central Repository
> > > d) Docker image to be pushed to Dockerhub
> > >
> > > **Staging Areas to Review**
> > >
> > > The staging areas containing the above mentioned artifacts are as
> > > follows, for your review:
> > > * All artifacts for (a), (b) can be found in the corresponding dev
> > > repository at dist.apache.org [1]
> > > * All artifacts for (c) can be found at the Apache Nexus Repository [2]
> > > * The docker image for (d) is staged on github [3]
> > >
> > > All artifacts are signed with the key
> > > DA359CBFCEB13FC302A8793FB655E6F7693D5FDE [4]
> > >
> > > Other links for your review:
> > > * JIRA release notes [5]
> > > * source code tag "release-1.8.0-rc1" [6]
> > > * PR to update the website Downloads page to include Kubernetes
&

[VOTE] Apache Flink Kubernetes Operator Release 1.8.0, release candidate #1

2024-03-14 Thread Maximilian Michels
Hi everyone,

Please review and vote on the release candidate #1 for the version
1.8.0 of the Apache Flink Kubernetes Operator, as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

**Release Overview**

As an overview, the release consists of the following:
a) Kubernetes Operator canonical source distribution (including the
Dockerfile), to be deployed to the release repository at
dist.apache.org
b) Kubernetes Operator Helm Chart to be deployed to the release
repository at dist.apache.org
c) Maven artifacts to be deployed to the Maven Central Repository
d) Docker image to be pushed to Dockerhub

**Staging Areas to Review**

The staging areas containing the above mentioned artifacts are as
follows, for your review:
* All artifacts for (a), (b) can be found in the corresponding dev
repository at dist.apache.org [1]
* All artifacts for (c) can be found at the Apache Nexus Repository [2]
* The docker image for (d) is staged on github [3]

All artifacts are signed with the key
DA359CBFCEB13FC302A8793FB655E6F7693D5FDE [4]

Other links for your review:
* JIRA release notes [5]
* source code tag "release-1.8.0-rc1" [6]
* PR to update the website Downloads page to include Kubernetes
Operator links [7]

**Vote Duration**

The voting time will run for at least 72 hours. It is adopted by
majority approval, with at least 3 PMC affirmative votes.

**Note on Verification**

You can follow the basic verification guide here [8]. Note that you
don't need to verify everything yourself, but please make note of what
you have tested together with your +- vote.

Thanks,
Max

[1] 
https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.8.0-rc1/
[2] https://repository.apache.org/content/repositories/orgapacheflink-1710/
[3] ghcr.io/apache/flink-kubernetes-operator:8938658
[4] https://dist.apache.org/repos/dist/release/flink/KEYS
[5] 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12353866&projectId=12315522
[6] https://github.com/apache/flink-kubernetes-operator/tree/release-1.8.0-rc1
[7] https://github.com/apache/flink-web/pull/726
[8] 
https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release


Re: [DISCUSS] Kubernetes Operator 1.8.0 release planning

2024-03-14 Thread Maximilian Michels
Hey everyone,

I don't see any immediate blockers, so I'm going to start the release process.

Thanks,
Max

On Tue, Feb 20, 2024 at 8:55 PM Maximilian Michels  wrote:
>
> Hey Rui, hey Ryan,
>
> Good points. Non-committers can't directly release but they can assist
> with the release. It would be great to get help from both of you in
> the release process.
>
> I'd be happy to be the release manager for the 1.8 release. As for the
> timing, I think we need to reach consensus in which form to include
> the new memory tuning. Also, considering that Gyula just merged a
> pretty big improvement / refactor of the metric collection code, we
> might want to give it another week. I would target the end of February
> to begin with the release process.
>
> Cheers,
> Max
>
> On Sun, Feb 18, 2024 at 4:48 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > Thanks Max and Ryan for the volunteering.
> >
> > To Ryan:
> >
> > I'm not sure whether non-flink-committers have permission to release.
> > If I remember correctly, multiple steps of the release process[1] need
> > the apache account, such as: Apache GPG key and Apache Nexus.
> >
> > If the release process needs the committer permission, feel free to
> > assist this release, thanks~
> >
> > To all:
> >
> > Max is one of the very active contributors to the
> > flink-kuberneters-operator
> > project, and he didn't release before. So Max as the release manager
> > makes sense to me.
> >
> > I can assist this release if all of you don't mind. In particular,
> > Autoscaler Standalone 1.8.0 is much improved compared to 1.7.0,
> > and I can help write the related Release note. Besides, I can help
> > check and test this release.
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Kubernetes+Operator+Release
> >
> > Best,
> > Rui
> >
> > On Wed, Feb 7, 2024 at 11:01 PM Ryan van Huuksloot <
> > ryan.vanhuuksl...@shopify.com> wrote:
> >
> > > I can volunteer to be a release manager. I haven't done it for
> > > Apache/Flink or the operator before so I may be a good candidate.
> > >
> > > Ryan van Huuksloot
> > > Sr. Production Engineer | Streaming Platform
> > > [image: Shopify]
> > > <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
> > >
> > >
> > > On Wed, Feb 7, 2024 at 6:06 AM Maximilian Michels  wrote:
> > >
> > >> It's very considerate that you want to volunteer to be the release
> > >> manager, but given that you have already managed one release, I would
> > >> ideally like somebody else to do it. Personally, I haven't managed an
> > >> operator release, although I've done it for Flink itself in the past.
> > >> Nevertheless, it would be nice to have somebody new to the process.
> > >>
> > >> Anyone reading this who wants to try being a release manager, please
> > >> don't be afraid to volunteer. Of course we'll be able to assist. That
> > >> would also be a good opportunity for us to update the docs regarding
> > >> the release process.
> > >>
> > >> Cheers,
> > >> Max
> > >>
> > >>
> > >> On Wed, Feb 7, 2024 at 10:08 AM Rui Fan <1996fan...@gmail.com> wrote:
> > >> >
> > >> > If the release is postponed 1-2 more weeks, I could volunteer
> > >> > as the one of the release managers.
> > >> >
> > >> > Best,
> > >> > Rui
> > >> >
> > >> > On Wed, Feb 7, 2024 at 4:54 AM Gyula Fóra  wrote:
> > >> >>
> > >> >> Given the proposed timeline was a bit short / rushed I agree with Max
> > >> that
> > >> >> we could wait 1-2 more weeks to wrap up the current outstanding bigger
> > >> >> features around memory tuning and the JDBC state store.
> > >> >>
> > >> >> In the meantime it would be great to involve 1-2 new committers (or
> > >> other
> > >> >> contributors) in the operator release process so that we have some
> > >> fresh
> > >> >> eyes on the process.
> > >> >> Would anyone be interested in volunteering to help with the next
> > >> release?
> > >> >>
> > >> >> Cheers,
> > >> >> Gyula
> > >> >&g

Re: Flink Kubernetes Operator Failing Over FlinkDeployments to a New Cluster

2024-03-13 Thread Maximilian Michels
Hi Kevin,

Theoretically, as long as you move over all k8s resources, failover
should work fine on the Flink and Flink Operator side. The tricky part
is the handover. You will need to backup all resources from the old
cluster, shutdown the old cluster, then re-create them on the new
cluster. The operator deployment and the Flink cluster should then
recover fine (assuming that high availability has been configured and
checkpointing is done to persistent storage available in the new
cluster). The operator state / Flink state is actually kept in
ConfigMaps which would be part of the resource dump.

This method has proven to work in case of Kubernetes cluster upgrades.
Moving to an entirely new cluster is a bit more involved but exporting
all resource definitions and re-importing them into the new cluster
should yield the same result as long as the checkpoint paths do not
change.

Probably something worth trying :)

-Max



On Wed, Mar 6, 2024 at 9:09 PM Kevin Lam  wrote:
>
> Another thought could be modifying the operator to have a behaviour where
> upon first deploy, it optionally (flag/param enabled) finds the most recent
> snapshot and uses that as the initialSavepointPath to restore and run the
> Flink job.
>
> On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam  wrote:
>
> > Hi there,
> >
> > We use the Flink Kubernetes Operator, and I am investigating how we can
> > easily support failing over a FlinkDeployment from one Kubernetes Cluster
> > to another in the case of an outage that requires us to migrate a large
> > number of FlinkDeployments from one K8s cluster to another.
> >
> > I understand one way to do this is to set `initialSavepoint` on all the
> > FlinkDeployments to the most recent/appropriate snapshot so the jobs
> > continue from where they left off, but for a large number of jobs, this
> > would be quite a bit of manual labor.
> >
> > Do others have an approach they are using? Any advice?
> >
> > Could this be something addressed in a future FLIP? Perhaps we could store
> > some kind of metadata in object storage so that the Flink Kubernetes
> > Operator can restore a FlinkDeployment from where it left off, even if the
> > job is shifted to another Kubernetes Cluster.
> >
> > Looking forward to hearing folks' thoughts!
> >


Re: [DISCUSS] Add "Special Thanks" Page on the Flink Website

2024-03-11 Thread Maximilian Michels
"Special Thanks" Page on the Flink Website
> > > > >
> > > > > Hi Max,
> > > > >
> > > > > Thank you for your input.
> > > > >
> > > > > According to ASF policy[1], the Thank Page is intended to thank third
> > > > > parties
> > > > > that provide physical resources like machines, services, and software
> > > > that
> > > > > the committers
> > > > >  or the project truly needs. I agree with Tison, such donation is
> > > > countable
> > > > > and that's why
> > > > > I started this discussion to collect the full list. The thank Page is
> > > not
> > > > > intended to thank working
> > > > > hours or contributions from individual volunteers which I think
> > > > > is recognized in other ways
> > > > > (e.g., credit of committer and PMC member).
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > > [1]: https://www.apache.org/foundation/marks/linking#projectthanks
> > > > >
> > > > > On Wed, 6 Mar 2024 at 01:14, tison  wrote:
> > > > >
> > > > > > Hi Max,
> > > > > >
> > > > > > Thanks for sharing your concerns :D
> > > > > >
> > > > > > I'd elaborate a bit on this topic with an example, that Apache
> > > Airflow
> > > > > > has a small section for its special sponsor who offers machines for
> > > CI
> > > > > > also [1].
> > > > > >
> > > > > > In my understanding, companies employ developers to invest time in
> > > the
> > > > > > development of Flink and that is large, vague, and hard to be fair
> > to
> > > > > > list all of the companies.
> > > > > >
> > > > > > However, physical resources like CI machines are countable and they
> > > > > > help the sustainability of our project in a rare way different than
> > > > > > individuals (few individuals can donate such resources). We can
> > > > > > maintain such a section or page for those sponsors so that it also
> > > > > > decreases the friction when the company asks "what we can gain"
> > (for
> > > > > > explicit credits, at least, and easy understanding).
> > > > > >
> > > > > > Any entity is welcome to add themselves as long as it's valid.
> > > > > >
> > > > > > For the fair part, I'm not an employee of both companies listed on
> > > the
> > > > > > demo page and I don't feel uncomfortable. Those companies do
> > invest a
> > > > > > lot on our project and I'd regard it as a chance to encourage other
> > > > > > companies to follow.
> > > > > >
> > > > > > Best,
> > > > > > tison.
> > > > > >
> > > > > > [1] https://github.com/apache/airflow?tab=readme-ov-file#sponsors
> > > > > >
> > > > > > Maximilian Michels  于2024年3月6日周三 00:49写道:
> > > > > > >
> > > > > > > I'm a bit torn on this idea. On the one hand, it makes sense to
> > > thank
> > > > > > > sponsors and entities who have supported Flink in the past. On
> > > other
> > > > > > > hand, this list is bound to be incomplete and maybe also biased,
> > > even
> > > > > > > if not intended to be so. I think the power of open-source comes
> > > from
> > > > > > > the unconditional donation of code and knowledge. Infrastructure
> > > > costs
> > > > > > > are a reality and donations in that area are meaningful, but they
> > > are
> > > > > > > just one piece of the total sum which consists of many volunteers
> > > and
> > > > > > > working hours. In my eyes, a Thank You page would have to display
> > > > each
> > > > > > > entity fairly which is going to be hard to achieve.
> > > > > > >
> > > > > > > -Max
> > > > > > >
> > > > > > > On Tue, Mar 5, 2024 at 2:30 PM Jingsong Li <
> > jingsongl...@gmail.com
> > > >
> > > > > > wrote:
> > > > >

Re: [DISCUSS] FLIP-433: State Access on DataStream API V2

2024-03-11 Thread Maximilian Michels
The FLIP mentions: "The contents described in this FLIP are all new
APIs and do not involve compatibility issues."

In this thread it looks like the plan is to remove the old state
declaration API. I think we should consider keeping the old APIs to
avoid breaking too many jobs. The new APIs will still be beneficial
for new jobs, e.g. for SQL jobs.

-Max

On Fri, Mar 8, 2024 at 4:39 AM Zakelly Lan  wrote:
>
> Hi Weijie,
>
> Thanks for your answer! Well I get your point. Since partitions are
> first-class citizens, and redistribution means how states migrate when
> partitions change, I'd be fine with deemphasizing the concept of
> keyed/operator state if we highlight the definition of partition in the
> document. Keeping `RedistributionMode` under `StateDeclaration` is also
> fine with me, as I guess it is only for internal usage.
> But still, from a user's point of view,  state can be characterized along
> two relatively independent dimensions, how states redistribute and the data
> structure. Thus I still suggest a chained-like configuration API that
> configures one aspect on each call, such as:
> ```
> # Keyed stream, no redistribution mode specified, the state will go with
> partition (no redistribution).  Keyed state
> StateDeclaration a = States.declare(name).listState(type);
>
> # Keyed stream, redistribution strategy specified, the state follows the
> specified redistribute strategy.   Operator state
> StateDeclaration b =
> States.declare(name).listState(type).redistributeBy(strategy);
>
> # Non-keyed stream, redistribution strategy *must be* specified.
> StateDeclaration c =
> States.declare(name).listState(type).redistributeBy(strategy);
>
> # Broadcast stream and state
> StateDeclaration d = States.declare(name).mapState(typeK,
> typeV).broadcast();
> ```
> It can drive users to think about redistribution issues when needed. And it
> also provides more flexibility to add more combinations such as
> broadcasting list state, or chain more configurable aspects such as adding
> `withTtl()` in future. WDYT?
>
>
> Best,
> Zakelly
>
> On Thu, Mar 7, 2024 at 6:04 PM weijie guo  wrote:
>
> > Hi Jinzhong,
> >
> > Thanks for the reply!
> >
> > > Overall, I think that the “Eager State Declaration” is a good proposal,
> > which can enhance Flink's state management capabilities and provide
> > possibilities for subsequent state optimizations.
> >
> > It's nice to see that people who are familiar with the state stuff like
> > this proposal. :)
> >
> > >  When the user attempts to access an undeclared state at runtime, it is
> > more reasonable to throw an exception rather than returning Option#empty,
> > as Gyula mentioned above.
> >
> > Yes, I agree that this is better then a confused empty, and I have modified
> > the corresponding part of this FLIP.
> >
> > > In addition, I'm not quite sure whether all of the existing usage in
> > which states are registered at runtime dynamically can be migrated to the
> > "Eager State Declaration" style with minimal cost?
> >
> > I think for most user functions, this is fairly straightforward to migrate.
> > But states whose declarations depend on runtime information(e.g.
> > RuntimeContext) are, in principle, not supported in the new API. Anyway,
> > the old and new apis are completely incompatible, so rewriting jobs is
> > inevitable. User can think about how to write a good process function that
> > conforms to the eager declaration style.
> >
> > > For state TTL, should StateDeclaration also provide interfaces for users
> > to declare state ttl?
> >
> > Of course, We can and we need to provide this one. But whether or not it's
> > in this FLIP isn't very important for me, because we're mainly talking
> > about the general principles and ways of declaring and accessing state in
> > this FLIP. I promise we won't leave it out in the end D).
> >
> >
> >
> > Best regards,
> >
> > Weijie
> >
> >
> > Jinzhong Li  于2024年3月7日周四 17:34写道:
> >
> > > Hi Weijie,
> > >
> > > Thanks for driving this!
> > >
> > > 1. Overall, I think that the “Eager State Declaration” is a good
> > proposal,
> > > which can enhance Flink's state management capabilities and provide
> > > possibilities for subsequent state optimizations.
> > >
> > > 2. When the user attempts to access an undeclared state at runtime, it is
> > > more reasonable to throw an exception rather than returning Option#empty,
> > > as Gyula mentioned above.
> > > In addition, I'm not quite sure whether all of the existing usage in
> > which
> > > states are registered at runtime dynamically can be migrated to the
> > "Eager
> > > State Declaration" style with minimal cost?
> > >
> > > 3. For state TTL, should StateDeclaration also provide interfaces for
> > users
> > > to declare state ttl?
> > >
> > > Best,
> > > Jinzhong Li
> > >
> > >
> > > On Thu, Mar 7, 2024 at 5:08 PM weijie guo 
> > > wrote:
> > >
> > > > Hi Hangxiang,
> > > >
> > > > Thanks for your reply!
> > > >
> > > > > We have also discussed in FLIP-359/FLINK

Re: [VOTE] FLIP-314: Support Customized Job Lineage Listener

2024-03-11 Thread Maximilian Michels
+1 (binding)

Max


On Thu, Feb 29, 2024 at 4:24 AM Hang Ruan  wrote:
>
> +1 (non-binding)
>
> Best,
> Hang
>
> weijie guo  于2024年2月29日周四 09:55写道:
>
> > +1 (binding)
> >
> > Best regards,
> >
> > Weijie
> >
> >
> > Feng Jin  于2024年2月29日周四 09:37写道:
> >
> > > +1 (non-binding)
> > >
> > > Best,
> > > Feng Jin
> > >
> > > On Thu, Feb 29, 2024 at 4:41 AM Márton Balassi  > >
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Marton
> > > >
> > > > On Wed, Feb 28, 2024 at 5:14 PM Gyula Fóra 
> > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > Gyula
> > > > >
> > > > > On Wed, Feb 28, 2024 at 11:10 AM Maciej Obuchowski <
> > > > mobuchow...@apache.org
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > Best,
> > > > > > Maciej Obuchowski
> > > > > >
> > > > > > śr., 28 lut 2024 o 10:29 Zhanghao Chen 
> > > > > > napisał(a):
> > > > > >
> > > > > > > +1 (non-binding)
> > > > > > >
> > > > > > > Best,
> > > > > > > Zhanghao Chen
> > > > > > > 
> > > > > > > From: Yong Fang 
> > > > > > > Sent: Wednesday, February 28, 2024 10:12
> > > > > > > To: dev 
> > > > > > > Subject: [VOTE] FLIP-314: Support Customized Job Lineage Listener
> > > > > > >
> > > > > > > Hi devs,
> > > > > > >
> > > > > > > I would like to restart a vote about FLIP-314: Support Customized
> > > Job
> > > > > > > Lineage Listener[1].
> > > > > > >
> > > > > > > Previously, we added lineage related interfaces in FLIP-314.
> > Before
> > > > the
> > > > > > > interfaces were developed and merged into the master, @Maciej and
> > > > > > > @Zhenqiu provided valuable suggestions for the interface from the
> > > > > > > perspective of the lineage system. So we updated the interfaces
> > of
> > > > > > FLIP-314
> > > > > > > and discussed them again in the discussion thread [2].
> > > > > > >
> > > > > > > So I am here to initiate a new vote on FLIP-314, the vote will be
> > > > open
> > > > > > for
> > > > > > > at least 72 hours unless there is an objection or insufficient
> > > votes
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener
> > > > > > > [2]
> > > https://lists.apache.org/thread/wopprvp3ww243mtw23nj59p57cghh7mc
> > > > > > >
> > > > > > > Best,
> > > > > > > Fang Yong
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >


Re: [DISCUSS] Add "Special Thanks" Page on the Flink Website

2024-03-05 Thread Maximilian Michels
I'm a bit torn on this idea. On the one hand, it makes sense to thank
sponsors and entities who have supported Flink in the past. On other
hand, this list is bound to be incomplete and maybe also biased, even
if not intended to be so. I think the power of open-source comes from
the unconditional donation of code and knowledge. Infrastructure costs
are a reality and donations in that area are meaningful, but they are
just one piece of the total sum which consists of many volunteers and
working hours. In my eyes, a Thank You page would have to display each
entity fairly which is going to be hard to achieve.

-Max

On Tue, Mar 5, 2024 at 2:30 PM Jingsong Li  wrote:
>
> +1 for setting up
>
> On Tue, Mar 5, 2024 at 5:39 PM Jing Ge  wrote:
> >
> > +1 and thanks for the proposal!
> >
> > Best regards,
> > Jing
> >
> > On Tue, Mar 5, 2024 at 10:26 AM tison  wrote:
> >
> > > I like this idea, so +1 for setting up.
> > >
> > > For anyone who have the access, this is a related thread about
> > > project-wise sponsor in the foundation level [1].
> > >
> > > Best,
> > > tison.
> > >
> > > [1] https://lists.apache.org/thread/2nv0x9gfk9lfnpb2315xgywyx84y97v6
> > >
> > > Jark Wu  于2024年3月5日周二 17:17写道:
> > > >
> > > > Sorry, I posted the wrong [7] link. The Flink benchmark ML link is:
> > > > https://lists.apache.org/thread/bkw6ozoflgltwfwmzjtgx522hyssfko6
> > > >
> > > >
> > > > On Tue, 5 Mar 2024 at 16:56, Jark Wu  wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > >
> > > > >
> > > > > I want to propose adding a "Special Thanks" page to our Apache Flink
> > > website [1]
> > > > >
> > > > > to honor and appreciate the
> > > > > companies and organizations that have sponsored
> > > > >
> > > > >
> > > > > machines or services for our project. The establishment of such a page
> > > serves as
> > > > >
> > > > >
> > > > > a public acknowledgment of our sponsors' contributions and
> > > simultaneously acts
> > > > >
> > > > >
> > > > > as a positive encouragement for other entities to consider supporting
> > > our project.
> > > > >
> > > > >
> > > > >
> > > > > Adding Per-Project Thanks pages is allowed by ASF policy[2], which
> > > says "PMCs
> > > > >
> > > > >
> > > > > may wish to provide recognition for third parties that provide
> > > software or services
> > > > >
> > > > >
> > > > > to the project's committers to further the goals of the project. These
> > > are typically
> > > > >
> > > > > called Per-Project Thanks pages".  Many Apache projects have added 
> > > > > such
> > > > >
> > > > > pages, for example, Apache HBase[3] and Apache Mina[4].
> > > > >
> > > > >
> > > > > To initiate this idea, I have drafted a preliminary page under the
> > > > > "About" menu
> > > > >
> > > > > on the
> > > > > Flink website to specifically thank Alibaba and Ververica, by 
> > > > > following
> > > > >
> > > > > the ASF guidelines and the Apache Mina project.
> > > > >
> > > > >
> > > > > page image:
> > > > >
> > > https://github.com/apache/flink/assets/5378924/e51aaffe-565e-46d1-90af-3900904afcc0
> > > > >
> > > > >
> > > > >
> > > > > Below companies are on the thanks list for their donation to Flink
> > > testing infrastructure:
> > > > >
> > > > > - Alibaba donated 8 machines (32vCPU,64GB) for running Flink CI builds
> > > [5].
> > > > >
> > > > >
> > > > > - Ververica donated 2 machines for hosting flink-ci repositories [6]
> > > and running Flink benchmarks [7].
> > > > >
> > > > >
> > > > > I may miss some other donations or companies, please add them if you
> > > know.
> > > > >
> > > > > Looking forward to your feedback about this proposal!
> > > > >
> > > > >
> > > > > Best,
> > > > >
> > > > > Jark
> > > > >
> > > > >
> > > > > [1]: https://flink.apache.org/
> > > > >
> > > > > [2]: https://www.apache.org/foundation/marks/linking#projectthanks
> > > > >
> > > > > [3]: https://hbase.apache.org/sponsors.html
> > > > >
> > > > > [4]: https://mina.apache.org/special-thanks.html
> > > > >
> > > > > [5]:
> > > > >
> > > https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-AvailableCustomBuildMachines
> > > > >
> > > > > [6]:
> > > > >
> > > https://cwiki.apache.org/confluence/display/FLINK/Continuous+Integration
> > > > >
> > > > > [7]:
> > > > >
> > > https://lists.apache.org/thread.html/41a68c775753a7841896690c75438e0a497634102e676db880f30225@%3Cdev.flink.apache.org%3E
> > > > >
> > >


[jira] [Created] (FLINK-34540) Tune number of task slots

2024-02-28 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-34540:
--

 Summary: Tune number of task slots
 Key: FLINK-34540
 URL: https://issues.apache.org/jira/browse/FLINK-34540
 Project: Flink
  Issue Type: Sub-task
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.8.0


Adjustments similar to FLINK-34152, but simpler because we only need to adjust 
heap memory and metaspace for the JobManager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34539) Tune JobManager memory of autoscaled jobs

2024-02-28 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-34539:
--

 Summary: Tune JobManager memory of autoscaled jobs
 Key: FLINK-34539
 URL: https://issues.apache.org/jira/browse/FLINK-34539
 Project: Flink
  Issue Type: Sub-task
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.8.0


The current autoscaling algorithm adjusts the parallelism of the job task 
vertices according to the processing needs. By adjusting the parallelism, we 
systematically scale the amount of CPU for a task. At the same time, we also 
indirectly change the amount of memory tasks have at their dispense. However, 
there are some problems with this.
 # Memory is overprovisioned: On scale up we may add more memory than we 
actually need. Even on scale down, the memory / cpu ratio can still be off and 
too much memory is used.
 # Memory is underprovisioned: For stateful jobs, we risk running into 
OutOfMemoryErrors on scale down. Even before running out of memory, too little 
memory can have a negative impact on the effectiveness of the scaling.

We lack the capability to tune memory proportionally to the processing needs. 
In the same way that we measure CPU usage and size the tasks accordingly, we 
need to evaluate memory usage and adjust the heap memory size.

https://docs.google.com/document/d/19GXHGL_FvN6WBgFvLeXpDABog2H_qqkw1_wrpamkFSc/edit

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34538) Tune memory of autoscaled jobs

2024-02-28 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-34538:
--

 Summary: Tune memory of autoscaled jobs
 Key: FLINK-34538
 URL: https://issues.apache.org/jira/browse/FLINK-34538
 Project: Flink
  Issue Type: New Feature
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.8.0


The current autoscaling algorithm adjusts the parallelism of the job task 
vertices according to the processing needs. By adjusting the parallelism, we 
systematically scale the amount of CPU for a task. At the same time, we also 
indirectly change the amount of memory tasks have at their dispense. However, 
there are some problems with this.
 # Memory is overprovisioned: On scale up we may add more memory than we 
actually need. Even on scale down, the memory / cpu ratio can still be off and 
too much memory is used.
 # Memory is underprovisioned: For stateful jobs, we risk running into 
OutOfMemoryErrors on scale down. Even before running out of memory, too little 
memory can have a negative impact on the effectiveness of the scaling.

We lack the capability to tune memory proportionally to the processing needs. 
In the same way that we measure CPU usage and size the tasks accordingly, we 
need to evaluate memory usage and adjust the heap memory size.

https://docs.google.com/document/d/19GXHGL_FvN6WBgFvLeXpDABog2H_qqkw1_wrpamkFSc/edit

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Kubernetes Operator 1.8.0 release planning

2024-02-20 Thread Maximilian Michels
Hey Rui, hey Ryan,

Good points. Non-committers can't directly release but they can assist
with the release. It would be great to get help from both of you in
the release process.

I'd be happy to be the release manager for the 1.8 release. As for the
timing, I think we need to reach consensus in which form to include
the new memory tuning. Also, considering that Gyula just merged a
pretty big improvement / refactor of the metric collection code, we
might want to give it another week. I would target the end of February
to begin with the release process.

Cheers,
Max

On Sun, Feb 18, 2024 at 4:48 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Thanks Max and Ryan for the volunteering.
>
> To Ryan:
>
> I'm not sure whether non-flink-committers have permission to release.
> If I remember correctly, multiple steps of the release process[1] need
> the apache account, such as: Apache GPG key and Apache Nexus.
>
> If the release process needs the committer permission, feel free to
> assist this release, thanks~
>
> To all:
>
> Max is one of the very active contributors to the
> flink-kuberneters-operator
> project, and he didn't release before. So Max as the release manager
> makes sense to me.
>
> I can assist this release if all of you don't mind. In particular,
> Autoscaler Standalone 1.8.0 is much improved compared to 1.7.0,
> and I can help write the related Release note. Besides, I can help
> check and test this release.
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Kubernetes+Operator+Release
>
> Best,
> Rui
>
> On Wed, Feb 7, 2024 at 11:01 PM Ryan van Huuksloot <
> ryan.vanhuuksl...@shopify.com> wrote:
>
> > I can volunteer to be a release manager. I haven't done it for
> > Apache/Flink or the operator before so I may be a good candidate.
> >
> > Ryan van Huuksloot
> > Sr. Production Engineer | Streaming Platform
> > [image: Shopify]
> > <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
> >
> >
> > On Wed, Feb 7, 2024 at 6:06 AM Maximilian Michels  wrote:
> >
> >> It's very considerate that you want to volunteer to be the release
> >> manager, but given that you have already managed one release, I would
> >> ideally like somebody else to do it. Personally, I haven't managed an
> >> operator release, although I've done it for Flink itself in the past.
> >> Nevertheless, it would be nice to have somebody new to the process.
> >>
> >> Anyone reading this who wants to try being a release manager, please
> >> don't be afraid to volunteer. Of course we'll be able to assist. That
> >> would also be a good opportunity for us to update the docs regarding
> >> the release process.
> >>
> >> Cheers,
> >> Max
> >>
> >>
> >> On Wed, Feb 7, 2024 at 10:08 AM Rui Fan <1996fan...@gmail.com> wrote:
> >> >
> >> > If the release is postponed 1-2 more weeks, I could volunteer
> >> > as the one of the release managers.
> >> >
> >> > Best,
> >> > Rui
> >> >
> >> > On Wed, Feb 7, 2024 at 4:54 AM Gyula Fóra  wrote:
> >> >>
> >> >> Given the proposed timeline was a bit short / rushed I agree with Max
> >> that
> >> >> we could wait 1-2 more weeks to wrap up the current outstanding bigger
> >> >> features around memory tuning and the JDBC state store.
> >> >>
> >> >> In the meantime it would be great to involve 1-2 new committers (or
> >> other
> >> >> contributors) in the operator release process so that we have some
> >> fresh
> >> >> eyes on the process.
> >> >> Would anyone be interested in volunteering to help with the next
> >> release?
> >> >>
> >> >> Cheers,
> >> >> Gyula
> >> >>
> >> >> On Tue, Feb 6, 2024 at 4:35 PM Maximilian Michels 
> >> wrote:
> >> >>
> >> >> > Thanks for starting the discussion Gyula!
> >> >> >
> >> >> > It comes down to how important the outstanding changes are for the
> >> >> > release. Both the memory tuning as well as the JDBC changes probably
> >> >> > need 1-2 weeks realistically to complete the initial spec. For the
> >> >> > memory tuning, I would prefer merging it in the current state as an
> >> >> > experimental feature for the release which comes disabled out of the
&

Re: Re: [VOTE] FLIP-417: Expose JobManagerOperatorMetrics via REST API

2024-02-14 Thread Maximilian Michels
No objections.

-Max

On Wed, Feb 14, 2024 at 1:49 PM Alexander Fedulov
 wrote:
>
> Hi Mason,
>
> the adjustments of not requiring the operator ID make sense to me since
> this is the more prevalent expected usage pattern. I guess one small
> clarification that might be needed is the mention of the return data type.
> I assume since there are potentially multiple coordinators that can belong
> to the same vertex, the proposed new endpoint will return an array. Is that
> the plan?
>
> Overall +1 from my side.
>
> Best,
> Alex
>
> On Tue, 13 Feb 2024 at 22:49, Mason Chen  wrote:
>
> > Hi voters and devs,
> >
> > I'm inclined to close the voting thread with the additional minor details
> > to the FLIP. Please chime in if there are any objections!
> >
> > Best,
> > Mason
> >
> > On Wed, Feb 7, 2024 at 11:49 AM Mason Chen  wrote:
> >
> > > Hi Voters,
> > >
> > > JFYI, I have modified the proposed REST API path and added changes to the
> > > metric scope configuration--you can find the reasoning and discussion in
> > > the `[DISCUSS]` thread and FLIP doc. Please let me know if there are any
> > > concerns.
> > >
> > > Best,
> > > Mason
> > >
> > > On Mon, Jan 29, 2024 at 5:32 AM Thomas Weise  wrote:
> > >
> > >> +1 (binding)
> > >>
> > >>
> > >> On Mon, Jan 29, 2024 at 5:45 AM Maximilian Michels 
> > >> wrote:
> > >>
> > >> > +1 (binding)
> > >> >
> > >> > On Fri, Jan 26, 2024 at 6:03 AM Rui Fan <1996fan...@gmail.com> wrote:
> > >> > >
> > >> > > +1(binding)
> > >> > >
> > >> > > Best,
> > >> > > Rui
> > >> > >
> > >> > > On Fri, Jan 26, 2024 at 11:55 AM Xuyang  wrote:
> > >> > >
> > >> > > > +1 (non-binding)
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > >
> > >> > > > Best!
> > >> > > > Xuyang
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > 在 2024-01-26 10:12:34,"Hang Ruan"  写道:
> > >> > > > >Thanks for the FLIP.
> > >> > > > >
> > >> > > > >+1 (non-binding)
> > >> > > > >
> > >> > > > >Best,
> > >> > > > >Hang
> > >> > > > >
> > >> > > > >Mason Chen  于2024年1月26日周五 04:51写道:
> > >> > > > >
> > >> > > > >> Hi Devs,
> > >> > > > >>
> > >> > > > >> I would like to start a vote on FLIP-417: Expose
> > >> > > > JobManagerOperatorMetrics
> > >> > > > >> via REST API [1] which has been discussed in this thread [2].
> > >> > > > >>
> > >> > > > >> The vote will be open for at least 72 hours unless there is an
> > >> > > > objection or
> > >> > > > >> not enough votes.
> > >> > > > >>
> > >> > > > >> [1]
> > >> > > > >>
> > >> > > > >>
> > >> > > >
> > >> >
> > >>
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-417%3A+Expose+JobManagerOperatorMetrics+via+REST+API
> > >> > > > >> [2]
> > >> > https://lists.apache.org/thread/tt0hf6kf5lcxd7g62v9dhpn3z978pxw0
> > >> > > > >>
> > >> > > > >> Best,
> > >> > > > >> Mason
> > >> > > > >>
> > >> > > >
> > >> >
> > >>
> > >
> >


Re: [DISCUSS] Kubernetes Operator 1.8.0 release planning

2024-02-07 Thread Maximilian Michels
It's very considerate that you want to volunteer to be the release
manager, but given that you have already managed one release, I would
ideally like somebody else to do it. Personally, I haven't managed an
operator release, although I've done it for Flink itself in the past.
Nevertheless, it would be nice to have somebody new to the process.

Anyone reading this who wants to try being a release manager, please
don't be afraid to volunteer. Of course we'll be able to assist. That
would also be a good opportunity for us to update the docs regarding
the release process.

Cheers,
Max


On Wed, Feb 7, 2024 at 10:08 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> If the release is postponed 1-2 more weeks, I could volunteer
> as the one of the release managers.
>
> Best,
> Rui
>
> On Wed, Feb 7, 2024 at 4:54 AM Gyula Fóra  wrote:
>>
>> Given the proposed timeline was a bit short / rushed I agree with Max that
>> we could wait 1-2 more weeks to wrap up the current outstanding bigger
>> features around memory tuning and the JDBC state store.
>>
>> In the meantime it would be great to involve 1-2 new committers (or other
>> contributors) in the operator release process so that we have some fresh
>> eyes on the process.
>> Would anyone be interested in volunteering to help with the next release?
>>
>> Cheers,
>> Gyula
>>
>> On Tue, Feb 6, 2024 at 4:35 PM Maximilian Michels  wrote:
>>
>> > Thanks for starting the discussion Gyula!
>> >
>> > It comes down to how important the outstanding changes are for the
>> > release. Both the memory tuning as well as the JDBC changes probably
>> > need 1-2 weeks realistically to complete the initial spec. For the
>> > memory tuning, I would prefer merging it in the current state as an
>> > experimental feature for the release which comes disabled out of the
>> > box. The reason is that it can already be useful to users who want to
>> > try it out; we have seen some interest in it. Then for the next
>> > release we will offer a richer feature set and might enable it by
>> > default.
>> >
>> > Cheers,
>> > Max
>> >
>> > On Tue, Feb 6, 2024 at 10:53 AM Rui Fan <1996fan...@gmail.com> wrote:
>> > >
>> > > Thanks Gyula for driving this release!
>> > >
>> > > Release 1.8.0 sounds make sense to me.
>> > >
>> > > As you said, I'm developing the JDBC event handler.
>> > > Since I'm going on vacation starting this Friday, and I have some
>> > > other work before I go on vacation. After evaluating my time today,
>> > > I found that I cannot complete the development, testing, and merging
>> > > of the JDBC event handler this week. So I tend to put the JDBC
>> > > event handler in the next version.
>> > >
>> > > Best,
>> > > Rui
>> > >
>> > > On Mon, Feb 5, 2024 at 11:42 PM Gyula Fóra  wrote:
>> > >
>> > > > Hi all!
>> > > >
>> > > > I would like to kick off the release planning for the operator 1.8.0
>> > > > release. The last operator release was November 22 last year. Since
>> > then we
>> > > > have added a number of fixes and improvements to both the operator and
>> > the
>> > > > autoscaler logic.
>> > > >
>> > > > There are a few outstanding PRs currently, including some larger
>> > features
>> > > > for the Autoscaler (JDBC event handler, Heap tuning), we have to make a
>> > > > decision regarding those as well whether to include in the release or
>> > not. @Maximilian
>> > > > Michels  , @Rui Fan <1996fan...@gmail.com> what's your
>> > > > take regarding those PRs? I generally like to be a bit more
>> > conservative
>> > > > with large new features to avoid introducing last minute instabilities.
>> > > >
>> > > > My proposal would be to aim for the end of this week as the freeze date
>> > > > (Feb 9) and then we can prepare RC1 on monday.
>> > > >
>> > > > I am happy to volunteer as a release manager but I am of course open to
>> > > > working together with someone on this.
>> > > >
>> > > > What do you think?
>> > > >
>> > > > Cheers,
>> > > > Gyula
>> > > >
>> >


Re: [DISCUSS] Kubernetes Operator 1.8.0 release planning

2024-02-06 Thread Maximilian Michels
Thanks for starting the discussion Gyula!

It comes down to how important the outstanding changes are for the
release. Both the memory tuning as well as the JDBC changes probably
need 1-2 weeks realistically to complete the initial spec. For the
memory tuning, I would prefer merging it in the current state as an
experimental feature for the release which comes disabled out of the
box. The reason is that it can already be useful to users who want to
try it out; we have seen some interest in it. Then for the next
release we will offer a richer feature set and might enable it by
default.

Cheers,
Max

On Tue, Feb 6, 2024 at 10:53 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Thanks Gyula for driving this release!
>
> Release 1.8.0 sounds make sense to me.
>
> As you said, I'm developing the JDBC event handler.
> Since I'm going on vacation starting this Friday, and I have some
> other work before I go on vacation. After evaluating my time today,
> I found that I cannot complete the development, testing, and merging
> of the JDBC event handler this week. So I tend to put the JDBC
> event handler in the next version.
>
> Best,
> Rui
>
> On Mon, Feb 5, 2024 at 11:42 PM Gyula Fóra  wrote:
>
> > Hi all!
> >
> > I would like to kick off the release planning for the operator 1.8.0
> > release. The last operator release was November 22 last year. Since then we
> > have added a number of fixes and improvements to both the operator and the
> > autoscaler logic.
> >
> > There are a few outstanding PRs currently, including some larger features
> > for the Autoscaler (JDBC event handler, Heap tuning), we have to make a
> > decision regarding those as well whether to include in the release or not. 
> > @Maximilian
> > Michels  , @Rui Fan <1996fan...@gmail.com> what's your
> > take regarding those PRs? I generally like to be a bit more conservative
> > with large new features to avoid introducing last minute instabilities.
> >
> > My proposal would be to aim for the end of this week as the freeze date
> > (Feb 9) and then we can prepare RC1 on monday.
> >
> > I am happy to volunteer as a release manager but I am of course open to
> > working together with someone on this.
> >
> > What do you think?
> >
> > Cheers,
> > Gyula
> >


Re: [VOTE] Release flink-connector-kafka v3.1.0, release candidate #1

2024-02-06 Thread Maximilian Michels
+1 (binding)

- Inspected source release (checked license, headers, no binaries)
- Verified checksums and signature

Cheers,
Max

On Sun, Feb 4, 2024 at 5:41 AM Qingsheng Ren  wrote:
>
> Thanks for driving this, Martijn!
>
> +1 (binding)
>
> - Verified checksum and signature
> - Verified no binaries in source
> - Built from source with Java 8
> - Reviewed web PRs
> - Run a Flink SQL job reading and writing Kafka on 1.18.1 cluster. Results
> are as expected.
>
> Best,
> Qingsheng
>
> On Tue, Jan 30, 2024 at 3:50 PM Mason Chen  wrote:
>
> > +1 (non-binding)
> >
> > * Verified LICENSE and NOTICE files (this RC has a NOTICE file that points
> > to 2023 that has since been updated on the main branch by Hang)
> > * Verified hashes and signatures
> > * Verified no binaries
> > * Verified poms point to 3.1.0
> > * Reviewed web PR
> > * Built from source
> > * Verified git tag
> >
> > In the same vein as the web PR, do we want to prepare the PR to update the
> > shortcode in the connector docs now [1]? Same for the Chinese version. I
> > wonder if that should be included in the connector release instructions.
> >
> > [1]
> >
> > https://github.com/apache/flink-connector-kafka/blob/d89a082180232bb79e3c764228c4e7dbb9eb6b8b/docs/content/docs/connectors/datastream/kafka.md#L39
> >
> > Best,
> > Mason
> >
> > On Sun, Jan 28, 2024 at 11:41 PM Hang Ruan  wrote:
> >
> > > +1 (non-binding)
> > >
> > > - Validated checksum hash
> > > - Verified signature
> > > - Verified that no binaries exist in the source archive
> > > - Build the source with Maven and jdk11
> > > - Verified web PR
> > > - Check that the jar is built by jdk8
> > >
> > > Best,
> > > Hang
> > >
> > > Martijn Visser  于2024年1月26日周五 21:05写道:
> > >
> > > > Hi everyone,
> > > > Please review and vote on the release candidate #1 for the Flink Kafka
> > > > connector version 3.1.0, as follows:
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > > This release is compatible with Flink 1.17.* and Flink 1.18.*
> > > >
> > > > The complete staging area is available for your review, which includes:
> > > > * JIRA release notes [1],
> > > > * the official Apache source release to be deployed to dist.apache.org
> > > > [2],
> > > > which are signed with the key with fingerprint
> > > > A5F3BCE4CBE993573EC5966A65321B8382B219AF [3],
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > > * source code tag v3.1.0-rc1 [5],
> > > > * website pull request listing the new release [6].
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > > Thanks,
> > > > Release Manager
> > > >
> > > > [1]
> > > >
> > > >
> > >
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353135
> > > > [2]
> > > >
> > > >
> > >
> > https://dist.apache.org/repos/dist/dev/flink/flink-connector-kafka-3.1.0-rc1
> > > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > [4]
> > > https://repository.apache.org/content/repositories/orgapacheflink-1700
> > > > [5]
> > > >
> > https://github.com/apache/flink-connector-kafka/releases/tag/v3.1.0-rc1
> > > > [6] https://github.com/apache/flink-web/pull/718
> > > >
> > >
> >


Re: [VOTE] Release flink-connector-parent, release candidate #1

2024-01-29 Thread Maximilian Michels
- Inspected the source for licenses and corresponding headers
- Checksums and signature OK

+1 (binding)

On Tue, Jan 23, 2024 at 4:08 PM Etienne Chauchot  wrote:
>
> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version
> 1.1.0, as follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
> [2], which are signed with the key with fingerprint
> D1A76BA19D6294DD0033F6843A019F0B8DD163EA [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag v1.1.0-rc1 [5],
> * website pull request listing the new release [6]
>
> * confluence wiki: connector parent upgrade to version 1.1.0 that will
> be validated after the artifact is released (there is no PR mechanism on
> the wiki) [7]
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
>
> Etienne
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353442
> [2]
> https://dist.apache.org/repos/dist/dev/flink/flink-connector-parent-1.1.0-rc1
> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> [4] https://repository.apache.org/content/repositories/orgapacheflink-1698/
> [5]
> https://github.com/apache/flink-connector-shared-utils/releases/tag/v1.1.0-rc1
> [6] https://github.com/apache/flink-web/pull/717
>
> [7]
> https://cwiki.apache.org/confluence/display/FLINK/Externalized+Connector+development


Re: Re: [VOTE] FLIP-417: Expose JobManagerOperatorMetrics via REST API

2024-01-29 Thread Maximilian Michels
+1 (binding)

On Fri, Jan 26, 2024 at 6:03 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> +1(binding)
>
> Best,
> Rui
>
> On Fri, Jan 26, 2024 at 11:55 AM Xuyang  wrote:
>
> > +1 (non-binding)
> >
> >
> > --
> >
> > Best!
> > Xuyang
> >
> >
> >
> >
> >
> > 在 2024-01-26 10:12:34,"Hang Ruan"  写道:
> > >Thanks for the FLIP.
> > >
> > >+1 (non-binding)
> > >
> > >Best,
> > >Hang
> > >
> > >Mason Chen  于2024年1月26日周五 04:51写道:
> > >
> > >> Hi Devs,
> > >>
> > >> I would like to start a vote on FLIP-417: Expose
> > JobManagerOperatorMetrics
> > >> via REST API [1] which has been discussed in this thread [2].
> > >>
> > >> The vote will be open for at least 72 hours unless there is an
> > objection or
> > >> not enough votes.
> > >>
> > >> [1]
> > >>
> > >>
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-417%3A+Expose+JobManagerOperatorMetrics+via+REST+API
> > >> [2] https://lists.apache.org/thread/tt0hf6kf5lcxd7g62v9dhpn3z978pxw0
> > >>
> > >> Best,
> > >> Mason
> > >>
> >


[jira] [Created] (FLINK-34213) Consider using accumulated busy time instead of busyMsPerSecond

2024-01-23 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-34213:
--

 Summary: Consider using accumulated busy time instead of 
busyMsPerSecond
 Key: FLINK-34213
 URL: https://issues.apache.org/jira/browse/FLINK-34213
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels


We might achieve much better accuracy if we used the accumulated busy time 
metrics from Flink, instead of the momentarily collected ones.

We would use the diff between the last accumulated and the current accumulated 
busy time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34152) Tune memory of autoscaled jobs

2024-01-18 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-34152:
--

 Summary: Tune memory of autoscaled jobs
 Key: FLINK-34152
 URL: https://issues.apache.org/jira/browse/FLINK-34152
 Project: Flink
  Issue Type: New Feature
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.8.0


The current autoscaling algorithm adjusts the parallelism of the job task 
vertices according to the processing needs. By adjusting the parallelism, we 
systematically scale the amount of CPU for a task. At the same time, we also 
indirectly change the amount of memory tasks have at their dispense. However, 
there are some problems with this.
 # Memory is overprovisioned: On scale up we may add more memory then we 
actually need. Even on scale down, the memory / cpu ratio can still be off and 
too much memory is used.
 # Memory is underprovisioned: For stateful jobs, we risk running into 
OutOfMemoryErrors on scale down. Even before running out of memory, too little 
memory can have a negative impact on the effectiveness of the scaling.

We lack the capability to tune memory proportionally to the processing needs. 
In the same way that we measure CPU usage and size the tasks accordingly, we 
need to evaluate memory usage and adjust the heap memory size.

A tuning algorithm could look like this:
h2. 1. Establish a memory baseline

We observe the average heap memory usage at task managers.
h2. 2. Calculate memory usage per record

The memory requirements per record can be estimated by calculating this ratio:
{noformat}
memory_per_rec = sum(heap_usage) / sum(records_processed)
{noformat}
This ratio is surprisingly constant based off empirical data.
h2. 3. Scale memory proportionally to the per-record memory needs
{noformat}
memory_per_tm = expected_records_per_sec * memory_per_rec / num_task_managers 
{noformat}
A minimum memory limit needs to be added to avoid scaling down memory too much. 
The max memory per TM should be equal to the initially defined user-specified 
limit from the ResourceSpec. 
{noformat}
memory_per_tm = max(min_limit, memory_per_tm)
memory_per_tm = min(max_limit, memory_per_tm) {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34151) Integrate Karpenter resource limits into cluster capacity check

2024-01-18 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-34151:
--

 Summary: Integrate Karpenter resource limits into cluster capacity 
check
 Key: FLINK-34151
 URL: https://issues.apache.org/jira/browse/FLINK-34151
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.8.0


FLINK-33771 added cluster capacity checking for Flink Autoscaling decisions. 
The checks respect the scaling limits of the Kubernetes Cluster Autoscaler. 

We should also support Karpenter-based resource checks, as Karpenter is the 
preferred method of expanding the cluster size in some environments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Re: [VOTE] Accept Flink CDC into Apache Flink

2024-01-10 Thread Maximilian Michels
+1 (binding)

On Wed, Jan 10, 2024 at 11:22 AM Martijn Visser
 wrote:
>
> +1 (binding)
>
> On Wed, Jan 10, 2024 at 4:43 AM Xingbo Huang  wrote:
> >
> > +1 (binding)
> >
> > Best,
> > Xingbo
> >
> > Dian Fu  于2024年1月10日周三 11:35写道:
> >
> > > +1 (binding)
> > >
> > > Regards,
> > > Dian
> > >
> > > On Wed, Jan 10, 2024 at 5:09 AM Sharath  wrote:
> > > >
> > > > +1 (non-binding)
> > > >
> > > > Best,
> > > > Sharath
> > > >
> > > > On Tue, Jan 9, 2024 at 1:02 PM Venkata Sanath Muppalla <
> > > sanath...@gmail.com>
> > > > wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Thanks,
> > > > > Sanath
> > > > >
> > > > > On Tue, Jan 9, 2024 at 11:16 AM Peter Huang <
> > > huangzhenqiu0...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > >
> > > > > > Best Regards
> > > > > > Peter Huang
> > > > > >
> > > > > >
> > > > > > On Tue, Jan 9, 2024 at 5:26 AM Jane Chan 
> > > wrote:
> > > > > >
> > > > > > > +1 (non-binding)
> > > > > > >
> > > > > > > Best,
> > > > > > > Jane
> > > > > > >
> > > > > > > On Tue, Jan 9, 2024 at 8:41 PM Lijie Wang <
> > > wangdachui9...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1 (non-binding)
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Lijie
> > > > > > > >
> > > > > > > > Jiabao Sun  于2024年1月9日周二
> > > 19:28写道:
> > > > > > > >
> > > > > > > > > +1 (non-binding)
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Jiabao
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 2024/01/09 09:58:04 xiangyu feng wrote:
> > > > > > > > > > +1 (non-binding)
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Xiangyu Feng
> > > > > > > > > >
> > > > > > > > > > Danny Cranmer  于2024年1月9日周二 17:50写道:
> > > > > > > > > >
> > > > > > > > > > > +1 (binding)
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Danny
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jan 9, 2024 at 9:31 AM Feng Jin 
> > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Feng Jin
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Jan 9, 2024 at 5:29 PM Yuxin Tan <
> > > ta...@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Yuxin
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Márton Balassi  于2024年1月9日周二 17:25写道:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > +1 (binding)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jan 9, 2024 at 10:15 AM Leonard Xu <
> > > > > > xb...@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +1(binding)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > Leonard
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2024年1月9日 下午5:08,Yangze Guo 
> > > 写道:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > Yangze Guo
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Jan 9, 2024 at 5:06 PM Robert Metzger <
> > > > > > > > > > > rmetz...@apache.org
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> +1 (binding)
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> On Tue, Jan 9, 2024 at 9:54 AM Guowei Ma <
> > > > > > > gu...@gmail.com
> > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >>> +1 (binding)
> > > > > > > > > > > > > > > >>> Best,
> > > > > > > > > > > > > > > >>> Guowei
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> On Tue, Jan 9, 2024 at 4:49 PM Rui Fan <
> > > > > > > 19...@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >  +1 (non-binding)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  Best,
> > > > > > > > > > > > > > >  Rui
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  On Tue, Jan 9, 2024 at 4:41 PM Hang Ruan <
> > > > > > > > > > > > ruanhang1...@gmail.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > Hang
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > gongzhongqiang 
> > > 于2024年1月9日周二
> > > > > > > > > > > > 16:25写道:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >> +1 non-binding
> > > > > > > > > > > > > > > >>
> > > >

[jira] [Created] (FLINK-33993) Ineffective scaling detection events are misleading

2024-01-04 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-33993:
--

 Summary: Ineffective scaling detection events are misleading
 Key: FLINK-33993
 URL: https://issues.apache.org/jira/browse/FLINK-33993
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Kubernetes Operator
Affects Versions: kubernetes-operator-1.7.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.8.0


When the ineffective scaling decision feature is turned off, events are 
regenerated which look like this:

{noformat}
Skipping further scale up after ineffective previous scale up for 
65c763af14a952c064c400d516c25529
{noformat}

This is misleading because no action will be taken. It is fair to inform users 
about ineffective scale up even when the feature is disabled but a different 
message should be printed to convey that no action will be taken.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [Discuss][Flink-31326] Flink autoscaler code

2024-01-04 Thread Maximilian Michels
We discussed in the PR that it's actually a feature, but thanks Yang
for bringing it up and improving the docs around this piece of code!

-Max

On Tue, Jan 2, 2024 at 10:06 PM Yang LI  wrote:
>
> Hello Rui,
>
> Here is the jira ticket https://issues.apache.org/jira/browse/FLINK-33966, I 
> have pushed a tiny pr for this ticket.
>
> Regards,
> Yang
>
> On Tue, 2 Jan 2024 at 16:15, Rui Fan <1996fan...@gmail.com> wrote:
>>
>> Thanks Yang for reporting this issue!
>>
>> You are right, these 2 conditions are indeed the same. It's unexpected IIUC.
>> Would you like to fix it?
>>
>> Feel free to create a FLINK JIRA to fix it if you would like to, and I'm
>> happy to
>> review!
>>
>> And cc @Maximilian Michels 
>>
>> Best,
>> Rui
>>
>> On Tue, Jan 2, 2024 at 11:03 PM Yang LI  wrote:
>>
>> > Hello,
>> >
>> > I see we have 2 times the same condition check in the
>> > function getNumRecordsInPerSecond (L220
>> > <
>> > https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/metrics/ScalingMetrics.java#L220
>> > >
>> > and
>> > L224
>> > <
>> > https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/metrics/ScalingMetrics.java#L224
>> > >).
>> > I imagine you want to use SOURCE_TASK_NUM_RECORDS_OUT_PER_SEC when the
>> > operator is not the source. Can you confirm this and if we have a FIP
>> > ticket to fix this?
>> >
>> > Regards,
>> > Yang LI
>> >


[ANNOUNCE] New Apache Flink Committer - Alexander Fedulov

2024-01-02 Thread Maximilian Michels
Happy New Year everyone,

I'd like to start the year off by announcing Alexander Fedulov as a
new Flink committer.

Alex has been active in the Flink community since 2019. He has
contributed more than 100 commits to Flink, its Kubernetes operator,
and various connectors [1][2].

Especially noteworthy are his contributions on deprecating and
migrating the old Source API functions and test harnesses, the
enhancement to flame graphs, the dynamic rescale time computation in
Flink Autoscaling, as well as all the small enhancements Alex has
contributed which make a huge difference.

Beyond code contributions, Alex has been an active community member
with his activity on the mailing lists [3][4], as well as various
talks and blog posts about Apache Flink [5][6].

Congratulations Alex! The Flink community is proud to have you.

Best,
The Flink PMC

[1] https://github.com/search?type=commits&q=author%3Aafedulov+org%3Aapache
[2] 
https://issues.apache.org/jira/browse/FLINK-28229?jql=status%20in%20(Resolved%2C%20Closed)%20AND%20assignee%20in%20(afedulov)%20ORDER%20BY%20resolved%20DESC%2C%20created%20DESC
[3] https://lists.apache.org/list?dev@flink.apache.org:lte=100M:Fedulov
[4] https://lists.apache.org/list?u...@flink.apache.org:lte=100M:Fedulov
[5] 
https://flink.apache.org/2020/01/15/advanced-flink-application-patterns-vol.1-case-study-of-a-fraud-detection-system/
[6] 
https://www.ververica.com/blog/presenting-our-streaming-concepts-introduction-to-flink-video-series


Re: [DISCUSS] Release flink-connector-parent v1.01

2023-12-21 Thread Maximilian Michels
> Anyone for pushing my pub key to apache dist ?

Done.

On Thu, Dec 21, 2023 at 2:36 PM Etienne Chauchot  wrote:
>
> Hello,
>
> All the ongoing PRs on this repo were merged. But, I'd like to leave
> some more days until feature freeze in case someone had a feature ready
> to integrate.
>
> Let' put the feature freeze to  00:00:00 UTC on December 27th.
>
> Best
>
> Etienne
>
> Le 15/12/2023 à 16:41, Ryan Skraba a écrit :
> > Hello!  I've been following this discussion (while looking and
> > building a lot of the connectors):
> >
> > +1 (non-binding) to doing a 1.1.0 release adding the configurability
> > of surefire and jvm flags.
> >
> > Thanks for driving this!
> >
> > Ryan
> >
> > On Fri, Dec 15, 2023 at 2:06 PM Etienne Chauchot  
> > wrote:
> >> Hi PMC members,
> >>
> >> Version will be 1.1.0 and not 1.0.1 as one of the PMC members already
> >> created this version tag in jira and tickets are targeted to this version.
> >>
> >> Anyone for pushing my pub key to apache dist ?
> >>
> >> Thanks
> >>
> >> Etienne
> >>
> >> Le 14/12/2023 à 17:51, Etienne Chauchot a écrit :
> >>> Hi all,
> >>>
> >>> It has been 2 weeks since the start of this release discussion. For
> >>> now only Sergey agreed to release. On a lazy consensus basis, let's
> >>> say that we leave until Monday for people to express concerns about
> >>> releasing connector-parent.
> >>>
> >>> In the meantime, I'm doing my environment setup and I miss the rights
> >>> to upload my GPG pub key to flink apache dist repo. Can one of the PMC
> >>> members push it ?
> >>>
> >>> Joint to this email is the updated KEYS file with my pub key added.
> >>>
> >>> Thanks
> >>>
> >>> Best
> >>>
> >>> Etienne
> >>>
> >>> Le 05/12/2023 à 16:30, Etienne Chauchot a écrit :
>  Hi Péter,
> 
>  My answers are inline
> 
> 
>  Best
> 
>  Etienne
> 
> 
>  Le 05/12/2023 à 05:27, Péter Váry a écrit :
> > Hi Etienne,
> >
> > Which branch would you cut the release from?
>  the parent_pom branch (consisting of a single maven pom file)
> > I find the flink-connector-parent branches confusing.
> >
> > If I merge a PR to the ci_utils branch, would it immediately change the 
> > CI
> > workflow of all of the connectors?
>  The ci_utils branch is basically one ci.yml workflow. _testing.yml
>  and maven test-project are both for testing the ci.yml workflow and
>  display what it can do to connector authors.
> 
>  As the connectors workflows refer ci.yml as this:
>  apache/flink-connector-shared-utils/.github/workflows/ci.yml@ci_utils,
>  if we merge changes to ci.yml all the CIs in the connectors' repo
>  will change.
> 
> > If I merge something to the release_utils branch, would it immediately
> > change the release process of all of the connectors?
>  I don't know how release-utils scripts are integrated with the
>  connectors' code yet
> > I would like to add the possibility of creating Python packages for the
> > connectors [1]. This would consist of some common code, which should 
> > reside
> > in flink-connector-parent, like:
> > - scripts for running Python test - test infra. I expect that this would
> > evolve in time
> > - ci workflow - this would be more slow moving, but might change if the
> > infra is charging
> > - release scripts - this would be slow moving, but might change too.
> >
> > I think we should have a release for all of the above components, so the
> > connectors could move forward on their own pace.
> 
>  I think it is quite out of the scope of this release: here we are
>  only talking about releasing a parent pom maven file for the connectors.
> 
> > What do you think?
> >
> > Thanks,
> > Péter
> >
> > [1]https://issues.apache.org/jira/browse/FLINK-33528
> >
> > On Thu, Nov 30, 2023, 16:55 Etienne Chauchot   
> > wrote:
> >
> >> Thanks Sergey for your vote. Indeed I have listed only the PRs merged
> >> since last release but there are these 2 open PRs that could be worth
> >> reviewing/merging before release.
> >>
> >> https://github.com/apache/flink-connector-shared-utils/pull/25
> >>
> >> https://github.com/apache/flink-connector-shared-utils/pull/20
> >>
> >> Best
> >>
> >> Etienne
> >>
> >>
> >> Le 30/11/2023 à 11:12, Sergey Nuyanzin a écrit :
> >>> thanks for volunteering Etienne
> >>>
> >>> +1 for releasing
> >>> however there is one more PR to enable custom jvm flags for connectors
> >>> in similar way it is done in Flink main repo for modules
> >>> It will simplify a bit support for java 17
> >>>
> >>> could we have this as well in the coming release?
> >>>
> >>>
> >>>
> >>> On Wed, Nov 29, 2023 at 11:40 AM Etienne 
> >>> Chauchot
> >>> wrote:
> >>>
>  Hi all,
> 
>  I would li

Re: [DISCUSS] Should Configuration support getting value based on String key?

2023-12-13 Thread Maximilian Michels
Hi Rui,

+1 for removing the @Deprecated annotation from `getString(String key,
String defaultValue)`. I would remove the other typed variants with
default values but I'm ok with keeping them if they are still used.

-Max

On Wed, Dec 13, 2023 at 4:59 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Hi devs,
>
> I'd like to start a discussion to discuss whether Configuration supports
> getting value based on the String key.
>
> In the FLIP-77[1] and FLINK-14493[2], a series of methods of Configuration
> are marked as @Deprecated, for example:
> - public String getString(String key, String defaultValue)
> - public long getLong(String key, long defaultValue)
> - public boolean getBoolean(String key, boolean defaultValue)
> - public int getInteger(String key, int defaultValue)
>
> The java doc suggests using getString(ConfigOption, String) or
> getOptional(ConfigOption), it means using ConfigOption as key
> instead of String.
>
> They are depreated since Flink-1.10, but these methods still
> be used in a lot of code. I think getString(String key, String
> defaultValue)
> shouldn't be deprecated with 2 reasons:
>
> 1. A lot of scenarios don't define the ConfigOption, they using
> String as the key and value directly, such as: StreamConfig,
> TaskConfig, DistributedCache, etc.
>
> 2. Some code wanna convert all keys or values, this converting
> is generic, so the getString(String key, String defaultValue) is needed.
> Such as: kubernetes-operator [3].
>
> Based on it, I have 2 solutions:
>
> 1. Removing the @Deprecated for these methods.
>
> 2. Only removing the @Deprecated for `public String getString(String key,
> String defaultValue)`
> and delete other getXxx(String key, Xxx defaultValue) directly.
> They have been depreated 8 minor versions ago. In general, the
> getString can replace getInteger, getBoolean, etc.
>
> I prefer solution1, because these getXxx methods are used for now,
> they are easy to use and don't bring large maintenance costs.
>
> Note: The alternative to public String getString(String key, String
> defaultValue)
> is Configuration.toMap. But the ease of use is not very convenient.
>
> Looking forward to hear more thoughts about it! Thank you~
> Also, very much looking forward to feedback from Dawid, the author of
> FLIP-77.
>
> [1] https://cwiki.apache.org/confluence/x/_RPABw
> [2] https://issues.apache.org/jira/browse/FLINK-14493
> [3]
> https://github.com/apache/flink-kubernetes-operator/pull/729/files#r1424811105
>
> Best,
> Rui


Re: [VOTE] FLIP-401: REST API JSON response deserialization unknown field tolerance

2023-12-12 Thread Maximilian Michels
+1 (binding)

On Tue, Dec 12, 2023 at 2:23 PM Peter Huang  wrote:
>
> +1 Non-binding
>
>
> Peter Huang
>
> Őrhidi Mátyás 于2023年12月12日 周二下午9:14写道:
>
> > +1
> > Matyas
> >
> > On Mon, Dec 11, 2023 at 10:26 PM Gyula Fóra  wrote:
> >
> > > +1
> > >
> > > Gyula
> > >
> > > On Mon, Dec 11, 2023 at 1:26 PM Gabor Somogyi  > >
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I'd like to start a vote on FLIP-401: REST API JSON response
> > > > deserialization unknown field tolerance [1] which has been discussed in
> > > > this thread [2].
> > > >
> > > > The vote will be open for at least 72 hours unless there is an
> > objection
> > > or
> > > > not enough votes.
> > > >
> > > > BR,
> > > > G
> > > >
> > > > [1]
> > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-401%3A+REST+API+JSON+response+deserialization+unknown+field+tolerance
> > > > [2] https://lists.apache.org/thread/s52w9cf60d6s10bpzv9qjczpl6m394rz
> > > >
> > >
> >


Re: [DISCUSS] Change the default restart-strategy to exponential-delay

2023-12-12 Thread Maximilian Michels
Thank you Rui! I think a 1.5 multiplier is a reasonable tradeoff
between restarting fast but not putting too much pressure on the
cluster due to restarts.

-Max

On Tue, Dec 12, 2023 at 8:19 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Hi Maximilian and Mason,
>
> Thanks a lot for your feedback!
>
> After an offline consultation with Max, I guess I understand your
> concern for now: when flink job restarts, it will make a bunch of
> calls to the Kubernetes API, e.g. read/write to config maps, create
> task managers. Currently, the default restart strategy is fixed-delay
> with 1s delay time, so flink will restart jobs with high frequency
> even if flink jobs cannot be started. It will cause the Kubernetes
> cluster became unstable.
>
> That's why I propose changing the default restart strategy to
> exponential-delay. It can achieve: restarts happen quickly
> enough unless there are consecutive failures. It is helpful for
> the stability of external components.
>
> After discussing with Max and Zhu Zhu at the PR comment[1],
> Max suggested using 1.5 as the default value of backoff-multiplier
> instead of 1.2. The 1.2 is a little small(delay time is too short).
> This picture[2] is the relationship between restart-attempts and
> retry-delay-time when backoff-multiplier is 1.2 and 1.5:
>
> - The delay-time will reach 1 min after 12 attempts when backoff-multiplier 
> is 1.5
> - The delay-time will reach 1 min after 24 attempts when backoff-multiplier 
> is 1.2
>
> Is there any other suggestion? Looking forward to more feedback, thanks~
>
> BTW, as Zhu said in the comment[1], if we update the default value,
> a new vote is needed for this default value. So I will pause
> FLINK-33736[1] first, and the rest of the JIRAs of FLIP-364 will be
> continued.
>
> To Mason:
>
> If I understand your concerns correctly, I still don't know how
> to benchmark. The kubernetes cluster instability only happens
> when one cluster has a lot of jobs. In general, the test cannot
> reproduce the pressure. Could you elaborate on how to
> benchmark for this?
>
> After this FLIP, the default restart frequency will be reduced
> significantly. Especially when a job fails consecutively.
> Do you think the benchmark is necessary?
>
> Looking forward to your feedback, thanks~
>
> [1] https://github.com/apache/flink/pull/23247#discussion_r1422626734
> [2] 
> https://github.com/apache/flink/assets/38427477/642c57e0-b415-4326-af05-8b506c5fbb3a
> [3] https://issues.apache.org/jira/browse/FLINK-33736
>
> Best,
> Rui
>
> On Thu, Dec 7, 2023 at 10:57 PM Maximilian Michels  wrote:
>>
>> Hey Rui,
>>
>> +1 for changing the default restart strategy to exponential-delay.
>> This is something all users eventually run into. They end up changing
>> the restart strategy to exponential-delay. I think the current
>> defaults are quite balanced. Restarts happen quickly enough unless
>> there are consecutive failures where I think it makes sense to double
>> the waiting time up till the max.
>>
>> -Max
>>
>>
>> On Wed, Dec 6, 2023 at 12:51 AM Mason Chen  wrote:
>> >
>> > Hi Rui,
>> >
>> > Sorry for the late reply. I was suggesting that perhaps we could do some
>> > testing with Kubernetes wrt configuring values for the exponential restart
>> > strategy. We've noticed that the default strategy in 1.17 caused a lot of
>> > requests to the K8s API server for unstable deployments.
>> >
>> > However, people in different Kubernetes setups will have different limits
>> > so it would be challenging to provide a general benchmark. Another thing I
>> > found helpful in the past is to refer to Kubernetes--for example, the
>> > default strategy is exponential for pod restarts and we could draw
>> > inspiration from what they have set as a general purpose default config.
>> >
>> > Best,
>> > Mason
>> >
>> > On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <1996fan...@gmail.com> wrote:
>> >
>> > > Hi David and Mason,
>> > >
>> > > Thanks for your feedback!
>> > >
>> > > To David:
>> > >
>> > > > Given that the new default feels more complex than the current 
>> > > > behavior,
>> > > if we decide to do this I think it will be important to include the
>> > > rationale you've shared in the documentation.
>> > >
>> > > Sounds make sense to me, I will add the related doc if we
>> > > update the default strategy.
>> > >
>> > > To Mason

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

2023-12-07 Thread Maximilian Michels
Hey Rui,

+1 for changing the default restart strategy to exponential-delay.
This is something all users eventually run into. They end up changing
the restart strategy to exponential-delay. I think the current
defaults are quite balanced. Restarts happen quickly enough unless
there are consecutive failures where I think it makes sense to double
the waiting time up till the max.

-Max


On Wed, Dec 6, 2023 at 12:51 AM Mason Chen  wrote:
>
> Hi Rui,
>
> Sorry for the late reply. I was suggesting that perhaps we could do some
> testing with Kubernetes wrt configuring values for the exponential restart
> strategy. We've noticed that the default strategy in 1.17 caused a lot of
> requests to the K8s API server for unstable deployments.
>
> However, people in different Kubernetes setups will have different limits
> so it would be challenging to provide a general benchmark. Another thing I
> found helpful in the past is to refer to Kubernetes--for example, the
> default strategy is exponential for pod restarts and we could draw
> inspiration from what they have set as a general purpose default config.
>
> Best,
> Mason
>
> On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi David and Mason,
> >
> > Thanks for your feedback!
> >
> > To David:
> >
> > > Given that the new default feels more complex than the current behavior,
> > if we decide to do this I think it will be important to include the
> > rationale you've shared in the documentation.
> >
> > Sounds make sense to me, I will add the related doc if we
> > update the default strategy.
> >
> > To Mason:
> >
> > > I suppose we could do some benchmarking on what works well for the
> > resource providers that Flink relies on e.g. Kubernetes. Based on
> > conferences and blogs,
> > > it seems most people are relying on Kubernetes to deploy Flink and the
> > restart strategy has a large dependency on how well Kubernetes can scale to
> > requests to redeploy the job.
> >
> > Sorry, I didn't understand what type of benchmarking
> > we should do, could you elaborate on it? Thanks a lot.
> >
> > Best,
> > Rui
> >
> > On Sat, Nov 18, 2023 at 3:32 AM Mason Chen  wrote:
> >
> >> Hi Rui,
> >>
> >> I suppose we could do some benchmarking on what works well for the
> >> resource providers that Flink relies on e.g. Kubernetes. Based on
> >> conferences and blogs, it seems most people are relying on Kubernetes to
> >> deploy Flink and the restart strategy has a large dependency on how well
> >> Kubernetes can scale to requests to redeploy the job.
> >>
> >> Best,
> >> Mason
> >>
> >> On Fri, Nov 17, 2023 at 10:07 AM David Anderson 
> >> wrote:
> >>
> >>> Rui,
> >>>
> >>> I don't have any direct experience with this topic, but given the
> >>> motivation you shared, the proposal makes sense to me. Given that the new
> >>> default feels more complex than the current behavior, if we decide to do
> >>> this I think it will be important to include the rationale you've shared 
> >>> in
> >>> the documentation.
> >>>
> >>> David
> >>>
> >>> On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <1996fan...@gmail.com> wrote:
> >>>
>  Hi dear flink users and devs:
> 
>  FLIP-364[1] intends to make some improvements to restart-strategy
>  and discuss updating some of the default values of exponential-delay,
>  and whether exponential-delay can be used as the default
>  restart-strategy.
>  After discussing at dev mail list[2], we hope to collect more feedback
>  from Flink users.
> 
>  # Why does the default restart-strategy need to be updated?
> 
>  If checkpointing is enabled, the default value is fixed-delay with
>  Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
>  the job will restart infinitely with high frequency when a job
>  continues to fail.
> 
>  When the Kafka cluster fails, a large number of flink jobs will be
>  restarted frequently. After the kafka cluster is recovered, a large
>  number of high-frequency restarts of flink jobs may cause the
>  kafka cluster to avalanche again.
> 
>  Considering the exponential-delay as the default strategy with
>  a couple of reasons:
> 
>  - The exponential-delay can reduce the restart frequency when
>    a job continues to fail.
>  - It can restart a job quickly when a job fails occasionally.
>  - The restart-strategy.exponential-delay.jitter-factor can avoid r
>    estarting multiple jobs at the same time. It’s useful to prevent
>    avalanches.
> 
>  # What are the current default values[4] of exponential-delay?
> 
>  restart-strategy.exponential-delay.initial-backoff : 1s
>  restart-strategy.exponential-delay.backoff-multiplier : 2.0
>  restart-strategy.exponential-delay.jitter-factor : 0.1
>  restart-strategy.exponential-delay.max-backoff : 5 min
>  restart-strategy.exponential-delay.reset-backoff-threshold : 1h
> 
>  backoff-multiplier=2 means 

[jira] [Created] (FLINK-33773) Add fairness to scaling decisions

2023-12-07 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-33773:
--

 Summary: Add fairness to scaling decisions
 Key: FLINK-33773
 URL: https://issues.apache.org/jira/browse/FLINK-33773
 Project: Flink
  Issue Type: New Feature
  Components: Autoscaler, Deployment / Kubernetes
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.8.0


The current scaling logic is inherently unfair. In a scenario of heavy backlog, 
whichever pipelines come first, they will end up taking most of the resources. 
Some kind of fairness should be introduced, for example:

* Cap the max number of resulting pods at a % of the cluster resources
* Allow scale up round-robin across all pipelines



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33771) Add cluster capacity awareness to Autoscaler

2023-12-07 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-33771:
--

 Summary: Add cluster capacity awareness to Autoscaler
 Key: FLINK-33771
 URL: https://issues.apache.org/jira/browse/FLINK-33771
 Project: Flink
  Issue Type: New Feature
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.8.0


To avoid starvation of pipelines when the Kubernetes cluster runs out of 
resources, new scaling attempts should be stopped. 

The Rescaling API will probably prevent most of this cases but we will also 
have to double-check there. 

For the config-based parallelism overrides, we have pretty good heuristics in 
the operator to check in Kubernetes for the approximate number of free cluster 
resources and the required scaling costs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33770) Autoscaler logs are full of deprecated key warnings

2023-12-07 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-33770:
--

 Summary: Autoscaler logs are full of deprecated key warnings
 Key: FLINK-33770
 URL: https://issues.apache.org/jira/browse/FLINK-33770
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Kubernetes Operator
Affects Versions: kubernetes-operator-1.7.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.8.0


We moved all autoscaler configuration from 
{{kubernetes.operator.job.autoscaler.*}} to {{job.autoscaler.*}}. 

With the latest release, the logs are full with logs like this:

{noformat}
level:  WARN 
logger:  org.apache.flink.configuration.Configuration 
message:  Config uses deprecated configuration key 
'kubernetes.operator.job.autoscaler.target.utilization' instead of proper key 
'job.autoscaler.target.utilization' 
{noformat}

The reason is that the configuration is loaded for every reconciliation.

This configuration is already widely adopted across hundreds of pipelines. I 
propose to remove the deprecation from the config keys and make them "fallback" 
keys instead which removes the deprecation warning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33710) Autoscaler redeploys pipeline for a NOOP parallelism change

2023-11-30 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-33710:
--

 Summary: Autoscaler redeploys pipeline for a NOOP parallelism 
change
 Key: FLINK-33710
 URL: https://issues.apache.org/jira/browse/FLINK-33710
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Kubernetes Operator
Affects Versions: kubernetes-operator-1.7.0, kubernetes-operator-1.6.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.8.0


The operator supports two modes to apply autoscaler changes:

# Use the internal Flink config {{pipeline.jobvertex-parallelism-overrides}} 
# Make use of Flink's Rescale API 

For (1), a string has to be generated for the Flink config with the actual 
overrides. This string has to be deterministic for a given map. But it is not.

Consider the following observed log:

{noformat}
  >>> Event  | Info| SPECCHANGED | SCALE change(s) detected (Diff: 
FlinkDeploymentSpec[flinkConfiguration.pipeline.jobvertex-parallelism-overrides 
: 
92542d1280187bd464274368a5f86977:3,9f979ed859083299d29f281832cb5be0:1,84881d7bda0dc3d44026e37403420039:1,1652184ffd0522859c7840a24936847c:1
 -> 
9f979ed859083299d29f281832cb5be0:1,84881d7bda0dc3d44026e37403420039:1,92542d1280187bd464274368a5f86977:3,1652184ffd0522859c7840a24936847c:1]),
 starting reconciliation. 
{noformat}

The overrides are identical but the order is different which triggers a 
redeploy. This does not seem to happen often but some deterministic string 
generation (e.g. sorting by key) is required to prevent any NOOP updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [ANNOUNCE] Apache Flink 1.16.3 released

2023-11-30 Thread Maximilian Michels
Thank you Rui for driving this!

On Thu, Nov 30, 2023 at 3:01 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> The Apache Flink community is very happy to announce the release of
> Apache Flink 1.16.3, which is the
> third bugfix release for the Apache Flink 1.16 series.
>
>
>
> Apache Flink® is an open-source stream processing framework for
> distributed, high-performing, always-available, and accurate data
> streaming applications.
>
>
>
> The release is available for download at:
>
> https://flink.apache.org/downloads.html
>
>
>
> Please check out the release blog post for an overview of the
> improvements for this bugfix release:
>
> https://flink.apache.org/2023/11/29/apache-flink-1.16.3-release-announcement/
>
>
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353259
>
>
>
> We would like to thank all contributors of the Apache Flink community
> who made this release possible!
>
>
>
> Feel free to reach out to the release managers (or respond to this
> thread) with feedback on the release process. Our goal is to
> constantly improve the release process. Feedback on what could be
> improved or things that didn't go so well are appreciated.
>
>
>
> Regards,
>
> Release Manager


Re: [VOTE] FLIP-364: Improve the restart-strategy

2023-11-30 Thread Maximilian Michels
+1 (binding)

-Max

On Thu, Nov 30, 2023 at 9:15 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> +1(binding)
>
> Best,
> Rui
>
> On Mon, Nov 13, 2023 at 11:01 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi everyone,
> >
> > Thank you to everyone for the feedback on FLIP-364: Improve the
> > restart-strategy[1]
> > which has been discussed in this thread [2].
> >
> > I would like to start a vote for it. The vote will be open for at least 72
> > hours unless there is an objection or not enough votes.
> >
> > [1] https://cwiki.apache.org/confluence/x/uJqzDw
> > [2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym
> >
> > Best,
> > Rui
> >


Re: [DISCUSS] FLIP-395: Migration to GitHub Actions

2023-11-24 Thread Maximilian Michels
Thanks for reviving the efforts here Matthias! +1 for the transition
to GitHub Actions.

As for ASF Infra Jenkins, it works fine. Jenkins is extremely
feature-rich. Not sure about the spare capacity though. I know that
for Apache Beam, Google donated a bunch of servers to get additional
build capacity.

-Max


On Thu, Nov 23, 2023 at 10:30 AM Matthias Pohl
 wrote:
>
> Btw. even though we've been focusing on GitHub Actions with this FLIP, I'm
> curious whether somebody has experience with Apache Infra's Jenkins
> deployment. The discussion I found about Jenkins [1] is quite out-dated
> (2014). I haven't worked with it myself but could imagine that there are
> some features provided through plugins which are missing in GitHub Actions.
>
> [1] https://lists.apache.org/thread/vs81xdhn3q777r7x9k7wd4dyl9kvoqn4
>
> On Tue, Nov 21, 2023 at 4:19 PM Matthias Pohl 
> wrote:
>
> > That's a valid point. I updated the FLIP accordingly:
> >
> >> Currently, the secrets (e.g. for S3 access tokens) are maintained by
> >> certain PMC members with access to the corresponding configuration in the
> >> Azure CI project. This responsibility will be moved to Apache Infra. They
> >> are in charge of handling secrets in the Apache organization. As a
> >> consequence, updating secrets is becoming a bit more complicated. This can
> >> be still considered an improvement from a legal standpoint because the
> >> responsibility is transferred from an individual company (i.e. Ververica
> >> who's the maintainer of the Azure CI project) to the Apache Foundation.
> >
> >
> > On Tue, Nov 21, 2023 at 3:37 PM Martijn Visser 
> > wrote:
> >
> >> Hi Matthias,
> >>
> >> Thanks for the write-up and for the efforts on this. I really hope
> >> that we can move away from Azure towards GHA for a better integration
> >> as well (directly seeing if a PR can be merged due to CI passing for
> >> example).
> >>
> >> The one thing I'm missing in the FLIP is how we would setup the
> >> secrets for the nightly runs (for the S3 tests, potential tests with
> >> external services etc). My guess is we need to provide the secret to
> >> ASF Infra and then we would be able to refer to them in a pipeline?
> >>
> >> Best regards,
> >>
> >> Martijn
> >>
> >> On Tue, Nov 21, 2023 at 3:05 PM Matthias Pohl
> >>  wrote:
> >> >
> >> > I realized that I mixed up FLIP IDs. FLIP-395 is already reserved [1]. I
> >> > switched to FLIP-396 [2] for the sake of consistency. 8)
> >> >
> >> > [1] https://lists.apache.org/thread/wjd3nbvg6nt93lb0sd52f0lzls6559tv
> >> > [2]
> >> >
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-396%3A+Migration+to+GitHub+Actions
> >> >
> >> > On Tue, Nov 21, 2023 at 2:58 PM Matthias Pohl 
> >> > wrote:
> >> >
> >> > > Hi everyone,
> >> > >
> >> > > The Flink community discussed migrating from Azure CI to GitHub
> >> Actions
> >> > > quite some time ago [1]. The efforts around that stalled due to
> >> limitations
> >> > > around self-hosted runner support from Apache Infra’s side. There
> >> were some
> >> > > recent developments on that topic. Apache Infra is experimenting with
> >> > > ephemeral runners now which might enable us to move ahead with GitHub
> >> > > Actions.
> >> > >
> >> > > The goal is to join the trial phase for ephemeral runners and
> >> experiment
> >> > > with our CI workflows in terms of stability and performance. At the
> >> end we
> >> > > can decide whether we want to abandon Azure CI and move to GitHub
> >> Actions
> >> > > or stick to the former one.
> >> > >
> >> > > Nico Weidner and Chesnay laid the groundwork on this topic in the
> >> past. I
> >> > > picked up the work they did and continued experimenting with it in my
> >> own
> >> > > fork XComp/flink [2] the past few weeks. The workflows are in a state
> >> where
> >> > > I think that we start moving the relevant code into Flink’s
> >> repository.
> >> > > Example runs for the basic workflow [3] and the extended (nightly)
> >> workflow
> >> > > [4] are provided.
> >> > >
> >> > > This will bring a few more changes to the Flink contributors. That is
> >> why
> >> > > I wanted to bring this discussion to the mailing list first. I did a
> >> write
> >> > > up on (hopefully) all related topics in FLIP-395 [5].
> >> > >
> >> > > I’m looking forward to your feedback.
> >> > >
> >> > > Matthias
> >> > >
> >> > > [1] https://lists.apache.org/thread/vcyx2nx0mhklqwm827vgykv8pc54gg3k
> >> > >
> >> > > [2] https://github.com/XComp/flink/actions
> >> > >
> >> > > [3] https://github.com/XComp/flink/actions/runs/6926309782
> >> > >
> >> > > [4] https://github.com/XComp/flink/actions/runs/6927443941
> >> > >
> >> > > [5]
> >> > >
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-395%3A+Migration+to+GitHub+Actions
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > [image: Aiven] 
> >> > >
> >> > > *Matthias Pohl*
> >> > > Opensource Software Engineer, *Aiven*
> >> > > matthias.p...@aiven.io|  +49 170 9869525
> >> > > aiven.io 

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.7.0, release candidate #1

2023-11-20 Thread Maximilian Michels
+1 (binding)

1. Downloaded the archives, checksums, and signatures
2. Verified the signatures and checksums
3. Extract and inspect the source code for binaries
4. Compiled and tested the source code via mvn verify
5. Verified license files / headers
6. Deployed helm chart to test cluster
7. Build and ran dynamic autoscaling example image
8. Tested autoscaling without rescaling API

Hit a non-fatal error collecting metrics in the stabilization phase
(this is a new feature), not a release blocker though [1].

-Max

[1] Caused by: org.apache.flink.runtime.rest.util.RestClientException:
[org.apache.flink.runtime.rest.handler.RestHandlerException: Cannot
connect to ResourceManager right now. Please try to refresh.
at 
org.apache.flink.runtime.rest.handler.resourcemanager.AbstractResourceManagerHandler.lambda$getResourceManagerGateway$0(AbstractResourceManagerHandler.java:91)
 at
java.base/java.util.Optional.orElseThrow(Unknown Source)
at 
org.apache.flink.runtime.rest.handler.resourcemanager.AbstractResourceManagerHandler.getResourceManagerGateway(AbstractResourceManagerHandler.java:89)
...

On Mon, Nov 20, 2023 at 5:48 PM Márton Balassi  wrote:
>
> +1 (binding)
>
> - Verified Helm repo works as expected, points to correct image tag, build,
> version
> - Verified basic examples + checked operator logs everything looks as
> expected
> - Verified hashes, signatures and source release contains no binaries
> - Ran built-in tests, built jars + docker image from source successfully
> - Upgraded the operator and the CRD from 1.6.1 to 1.7.0
>
> Best,
> Marton
>
> On Mon, Nov 20, 2023 at 2:03 PM Gyula Fóra  wrote:
>
> > +1 (binding)
> >
> > Verified:
> >  - Release files, maven repo contents, checksums, signature
> >  - Verified and installed from Helm chart
> >  - Ran basic stateful example and verified
> >- Upgrade flow
> >- No errors in logs
> >- Autoscaler (turn on/off, verify configmap cleared correctly)
> >- In-place scaling with 1.18 and adaptive scheduler
> >  - Built from source with Java 11 & 17
> >  - Checked release notes
> >
> > Cheers,
> > Gyula
> >
> > On Fri, Nov 17, 2023 at 1:59 PM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > +1(non-binding)
> > >
> > > - Downloaded artifacts from dist
> > > - Verified SHA512 checksums
> > > - Verified GPG signatures
> > > - Build the source with java-11 and java-17
> > > - Verified the license header
> > > - Verified that chart and appVersion matches the target release
> > > - RC repo works as Helm rep(helm repo add flink-operator-repo-1.7.0-rc1
> > >
> > >
> > https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.7.0-rc1/
> > > )
> > > - Verified Helm chart can be installed  (helm install
> > > flink-kubernetes-operator
> > > flink-operator-repo-1.7.0-rc1/flink-kubernetes-operator --set
> > > webhook.create=false)
> > > - Submitted the autoscaling demo, the autoscaler works well with rescale
> > > api (kubectl apply -f autoscaling.yaml)
> > > - Download Autoscaler standalone: wget
> > >
> > >
> > https://repository.apache.org/content/repositories/orgapacheflink-1672/org/apache/flink/flink-autoscaler-standalone/1.7.0/flink-autoscaler-standalone-1.7.0.jar
> > > - Ran Autoscaler standalone locally, it works well with rescale api
> > >
> > > Best,
> > > Rui
> > >
> > > On Fri, Nov 17, 2023 at 2:45 AM Mate Czagany  wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > - Checked signatures, checksums
> > > > - No binaries found in the source release
> > > > - Verified all source files contain the license header
> > > > - All pom files point to the correct version
> > > > - Verified Helm chart version and appVersion
> > > > - Verified Docker image tag
> > > > - Ran flink-autoscaler-standalone JAR downloaded from the maven
> > > repository
> > > > - Tested autoscaler upscales correctly on load with Flink 1.18
> > rescaling
> > > > API
> > > >
> > > > Thanks,
> > > > Mate
> > > >
> > > > Gyula Fóra  ezt írta (időpont: 2023. nov. 15., Sze,
> > > > 16:37):
> > > >
> > > > > Hi Everyone,
> > > > >
> > > > > Please review and vote on the release candidate #1 for the version
> > > 1.7.0
> > > > of
> > > > > Apache Flink Kubernetes Operator,
> > > > > as follows:
> > > > > [ ] +1, Approve the release
> > > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > > >
> > > > > **Release Overview**
> > > > >
> > > > > As an overview, the release consists of the following:
> > > > > a) Kubernetes Operator canonical source distribution (including the
> > > > > Dockerfile), to be deployed to the release repository at
> > > dist.apache.org
> > > > > b) Kubernetes Operator Helm Chart to be deployed to the release
> > > > repository
> > > > > at dist.apache.org
> > > > > c) Maven artifacts to be deployed to the Maven Central Repository
> > > > > d) Docker image to be pushed to dockerhub
> > > > >
> > > > > **Staging Areas to Review**
> > > > >
> > > > > The staging areas containing the above mentioned artifact

Re: [DISCUSS] FLIP-394: Add Metrics for Connector Agnostic Autoscaling

2023-11-17 Thread Maximilian Michels
Hi Mason,

Thank you for the proposal. This is a highly requested feature to make
the source scaling of Flink Autoscaling generic across all sources.
The current implementation handles every source individually, and if
we don't find any backlog metrics, we default to using busy time only.
At this point Kafka is the only supported source. We collect the
backlog size (pending metrics), as well as the number of available
splits / partitions.

For Kafka, we always read from all splits but I like how for the
generic interface we take note of both assigned and unassigned splits.
This allows for more flexible integration with other sources where we
might have additional splits we read from at a later point in time.

Considering Rui's point, I agree it makes sense to outline the
integration with existing sources. Other than that, +1 from my side
for the proposal.

Thanks,
Max

On Fri, Nov 17, 2023 at 4:06 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Hi Mason,
>
> Thank you for driving this proposal!
>
> Currently, Autoscaler only supports the maximum source parallelism
> of KafkaSource. Introducing the generic metric to support it is good
> to me, +1 for this proposal.
>
> I have a question:
> You added the metric in the flink repo, and Autoscaler will fetch this
> metric. But I didn't see any connector to register this metric. Currently,
> only IteratorSourceEnumerator setUnassignedSplitsGauge,
> and KafkaSource didn't register it. IIUC, if we don't do it, autoscaler
> still cannot fetch this metric, right?
>
> If yes, I suggest this FLIP includes registering metric part, otherwise
> these metrics still cannot work.
>
> Please correct me if I misunderstood anything, thanks~
>
> Best,
> Rui
>
> On Fri, Nov 17, 2023 at 6:53 AM Mason Chen  wrote:
>
> > Hi all,
> >
> > I would like to start a discussion on FLIP-394: Add Metrics for Connector
> > Agnostic Autoscaling [1].
> >
> > This FLIP recommends adding two metrics to make autoscaling work for
> > bounded split source implementations like IcebergSource. These metrics are
> > required by the Flink Kubernetes Operator autoscaler algorithm [2] to
> > retrieve information for the backlog and the maximum source parallelism.
> > The changes would affect the `@PublicEvolving` `SplitEnumeratorMetricGroup`
> > API of the source connector framework.
> >
> > Best,
> > Mason
> >
> > [1]
> >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-394%3A+Add+Metrics+for+Connector+Agnostic+Autoscaling
> > [2]
> >
> > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/#limitations
> >


[jira] [Created] (FLINK-33572) Minimize ConfigMap API operations for autoscaler state

2023-11-16 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-33572:
--

 Summary: Minimize ConfigMap API operations for autoscaler state
 Key: FLINK-33572
 URL: https://issues.apache.org/jira/browse/FLINK-33572
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.7.0


The newly introduced flush operation after the refactoring the autoscaler 
interfaces, optimizes the number of write operations to the underlying state 
store. A couple of further optimizations:

1. Any writes should be deferred until flush is called.
2. The flush routine should detect whether a write is needed and writing if 
there are no changes
3. Clearing state should only require one write operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33522) Savepoint upgrade mode fails despite the savepoint succeeding

2023-11-10 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-33522:
--

 Summary: Savepoint upgrade mode fails despite the savepoint 
succeeding
 Key: FLINK-33522
 URL: https://issues.apache.org/jira/browse/FLINK-33522
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.1, kubernetes-operator-1.6.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.7.0


Under certain circumstances, savepoint creation can succeed but the job fails 
afterwards. One example is when there are messages being distributed by the 
source coordinator to finished tasks. This is possibly a Flink bug although 
it's not clear how to solve this issue.

After the savepoint succeeded Flink fails the job like this:
{noformat}
Source (1/2) 
(cd4d56ddb71c0e763cc400bcfe2fd8ac_4081cf0163fcce7fe6af0cf07ad2d43c_0_0) 
switched from RUNNING to FAILED on host-taskmanager-1-1 @ ip(dataPort=36519). 
{noformat}
{noformat}
An OperatorEvent from an OperatorCoordinator to a task was lost. Triggering 
task failover to ensure consistency. Event: 'AddSplitEvents[[[B@722a23fa]]', 
targetTask: Source (1/2) - execution #0
Caused by:
org.apache.flink.runtime.operators.coordination.TaskNotRunningException: Task 
is not running, but in state FINISHED
   at 
org.apache.flink.runtime.taskmanager.Task.deliverOperatorEvent(Task.java:1502)
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.sendOperatorEventToTask
{noformat}

Inside the operator this is processed as:

{noformat}
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.scheduler.stopwithsavepoint.StopWithSavepointStoppingException:
 A savepoint has been created at: s3://..., but the corresponding job 
1b1a3061194c62ded6e2fe823b61b2ea failed during stopping. The savepoint is 
consistent, but might have uncommitted transactions. If you want to commit the 
transaction please restart a job from this savepoint. 

  
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) 
  
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022) 
  
org.apache.flink.kubernetes.operator.service.AbstractFlinkService.cancelJob(AbstractFlinkService.java:319)
 
  
org.apache.flink.kubernetes.operator.service.NativeFlinkService.cancelJob(NativeFlinkService.java:121)
 
  
org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.cancelJob(ApplicationReconciler.java:223)
 
  
org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:122)
 
 
org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:163)
  
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:136)
 
  
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:56)
 
  
io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:138)
 
  
io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:96) 
  
org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80)
 
  
io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:95) 
  
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:139)
 
  
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:119)
 
  
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:89)
 
  
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:62)
 
  
io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:414)
 
  
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
  
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
  java.lang.Thread.run(Thread.java:829) 
{noformat}

Subsequently we get the following because HA metadata is not available anymore. 
It has been cleared up after the terminal job failure:

{noformat}
org.apache.flink.kubernetes.operator.exception.RecoveryFailureException","message":"HA
 metadata not available to restore from last state. It is possible that the job 
has finished or terminally failed, or the configmaps have been deleted. 
{noformat}

The deployment needs to be manually restored from a savepo

Re: [DISCUSS] Kubernetes Operator 1.7.0 release planning

2023-11-01 Thread Maximilian Michels
+1 for targeting the release as soon as possible. Given the effort
that Rui has undergone to decouple the autoscaling implementation, it
makes sense to also include an alternative implementation with the
release. In the long run, I wonder whether the standalone
implementation should even be part of the Kubernetes operator
repository. It can be hosted in a different repository and simply
consume the flink-autoscaler jar. But the same applies to the
flink-autoscaler module. For this release, we can keep everything
together.

I have a minor issue [1] I would like to include in the release.

-Max

[1] https://issues.apache.org/jira/browse/FLINK-33429

On Tue, Oct 31, 2023 at 11:14 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Thanks Gyula for driving this release!
>
> I'd like to check with you and community, could we
> postpone the code freeze by a week?
>
> I'm developing the FLINK-33099[1], and the prod code is done.
> I need some time to develop the tests. I hope this feature is included in
> 1.7.0 for two main reasons:
>
> 1. We have completed the decoupling of the autoscaler and
> kubernetes-operator in 1.7.0. During the decoupling period, we modified
> a large number of autoscaler-related interfaces. The standalone autoscaler
> is an autoscaler process that can run independently. It can help us confirm
> whether the new interface is reasonable.
> 2. 1.18.0 was recently released, standalone autoscaler allows more users to
> play autoscaler and in-place rescale.
>
> I have created a draft PR[2] for FLINK-33099, it just includes prod code.
> I have run it manually, it works well. And I will try my best to finish all
> unit tests before Friday, but must finish all unit tests before next Monday
> at the latest.
>
> WDYT?
>
> I'm deeply sorry for the request to postpone the release.
>
> [1] https://issues.apache.org/jira/browse/FLINK-33099
> [2] https://github.com/apache/flink-kubernetes-operator/pull/698
>
> Best,
> Rui
>
> On Tue, Oct 31, 2023 at 4:10 PM Samrat Deb  wrote:
>
> > Thank you Gyula
> >
> > (+1 non-binding) in support of you taking on the role of release manager.
> >
> > > I think this is reasonable as I am not aware of any big features / bug
> > fixes being worked on right now. Given the size of the changes related to
> > the autoscaler module refactor we should try to focus the remaining time on
> > testing.
> >
> > I completely agree with you. Since the changes are quite extensive, it's
> > crucial to allocate more time for thorough testing and verification.
> >
> > Regarding working with you for the release, I might not have the necessary
> > privileges for that.
> >
> > However, I'd be more than willing to assist with testing the changes,
> > validating various features, and checking for any potential regressions in
> > the flink-kubernetes-operator.
> > Just let me know how I can support the testing efforts.
> >
> > Bests,
> > Samrat
> >
> >
> > On Tue, 31 Oct 2023 at 12:59 AM, Gyula Fóra  wrote:
> >
> > > Hi all!
> > >
> > > I would like to kick off the release planning for the operator 1.7.0
> > > release. We have added quite a lot of new functionality over the last few
> > > weeks and I think the operator is in a good state to kick this off.
> > >
> > > Based on the original release schedule we had Nov 1 as the proposed
> > feature
> > > freeze date and Nov 7 as the date for the release cut / rc1.
> > >
> > > I think this is reasonable as I am not aware of any big features / bug
> > > fixes being worked on right now. Given the size of the changes related to
> > > the autoscaler module refactor we should try to focus the remaining time
> > on
> > > testing.
> > >
> > > I am happy to volunteer as a release manager but I am of course open to
> > > working together with someone on this.
> > >
> > > What do you think?
> > >
> > > Cheers,
> > > Gyula
> > >
> >


[jira] [Created] (FLINK-33429) Metric collection during stabilization phase may error due to missing metrics

2023-11-01 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-33429:
--

 Summary: Metric collection during stabilization phase may error 
due to missing metrics
 Key: FLINK-33429
 URL: https://issues.apache.org/jira/browse/FLINK-33429
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler
Affects Versions: kubernetes-operator-1.7.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.7.0


The new code for the 1.7.0 release introduces metric collection during the 
stabilization phase to allow sampling the observed true processing rate. 
Metrics might not be fully initialized during that phase, as evident through 
the error metrics. The following error is thrown: 

{noformat}
java.lang.RuntimeException: Could not find required metric 
NUM_RECORDS_OUT_PER_SEC for 667f5d5aa757fb217b92c06f0f5d2bf2 
{noformat}

To prevent these errors shadowing actual errors, we should detect and ignore 
this recoverable exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: off for a week

2023-10-26 Thread Maximilian Michels
Have a great time off, Etienne!

On Thu, Oct 26, 2023 at 3:38 PM Etienne Chauchot  wrote:
>
> Hi,
>
> FYI, I'll be off and unresponsive for a week starting tomorrow evening.
> For ongoing work, please ping me before tomorrow evening or within a week
>
> Best
>
> Etienne


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.6.1, release candidate #1

2023-10-26 Thread Maximilian Michels
+1 (binding)

1. Downloaded the archives, checksums, and signatures
2. Verified the signatures and checksums ( gpg --recv-keys
B2D64016B940A7E0B9B72E0D7D0528B28037D8BC )
3. Extract and inspect the source code for binaries
4. Compiled and tested the source code via mvn verify
5. Verified license files / headers
6. Deployed helm chart to test cluster
7. Ran example job
8. Tested autoscaling without rescaling API

@Rui Can you add your key to
https://dist.apache.org/repos/dist/release/flink/KEYS ?

-Max

On Thu, Oct 26, 2023 at 1:53 PM Márton Balassi  wrote:
>
> Thank you, team. @David Radley: Not having Rui's key signed is not ideal,
> but is acceptable for the release.
>
> +1 (binding)
>
> - Verified Helm repo works as expected, points to correct image tag, build,
> version
> - Verified basic examples + checked operator logs everything looks as
> expected
> - Verified hashes, signatures and source release contains no binaries
> - Ran built-in tests, built jars + docker image from source successfully
>
> Best,
> Marton
>
> On Thu, Oct 26, 2023 at 12:24 PM David Radley 
> wrote:
>
> > Hi,
> > I downloaded the artifacts.
> >
> >   *   I did an install of the operator and ran the basic sample
> >   *   I checked the checksums
> >   *   Checked the GPG signatures
> >   *   Ran the UI
> >   *   Ran a Twistlock scan
> >   *   I installed 1.6 then did a helm upgrade
> >   *   I have not managed to do the source build and subsequent install yet.
> > I wanted to check these 2 things are what you would expect:
> >
> >   1.  I followed link
> > https://github.com/apache/flink-kubernetes-operator/pkgs/container/flink-kubernetes-operator/139454270?tag=51eeae1
> > And notice that it does not have a description . Is this correct?
> >
> >   1.  I get this in the gpg verification . Is this ok?
> >
> >
> > gpg --verify flink-kubernetes-operator-1.6.1-src.tgz.asc
> >
> > gpg: assuming signed data in 'flink-kubernetes-operator-1.6.1-src.tgz'
> >
> > gpg: Signature made Fri 20 Oct 2023 04:07:48 PDT
> >
> > gpg:using RSA key B2D64016B940A7E0B9B72E0D7D0528B28037D8BC
> >
> > gpg: Good signature from "Rui Fan fan...@apache.org > fan...@apache.org>" [unknown]
> >
> > gpg: WARNING: This key is not certified with a trusted signature!
> >
> > gpg:  There is no indication that the signature belongs to the
> > owner.
> >
> > Primary key fingerprint: B2D6 4016 B940 A7E0 B9B7  2E0D 7D05 28B2 8037 D8BC
> >
> >
> >
> >
> > Hi Everyone,
> >
> > Please review and vote on the release candidate #1 for the version 1.6.1 of
> > Apache Flink Kubernetes Operator,
> > as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > **Release Overview**
> >
> > As an overview, the release consists of the following:
> > a) Kubernetes Operator canonical source distribution (including the
> > Dockerfile), to be deployed to the release repository at dist.apache.org
> > b) Kubernetes Operator Helm Chart to be deployed to the release repository
> > at dist.apache.org
> > c) Maven artifacts to be deployed to the Maven Central Repository
> > d) Docker image to be pushed to dockerhub
> >
> > **Staging Areas to Review**
> >
> > The staging areas containing the above mentioned artifacts are as follows,
> > for your review:
> > * All artifacts for a,b) can be found in the corresponding dev repository
> > at dist.apache.org [1]
> > * All artifacts for c) can be found at the Apache Nexus Repository [2]
> > * The docker image for d) is staged on github [3]
> >
> > All artifacts are signed with the
> > key B2D64016B940A7E0B9B72E0D7D0528B28037D8BC [4]
> >
> > Other links for your review:
> > * source code tag "release-1.6.1-rc1" [5]
> > * PR to update the website Downloads page to
> > include Kubernetes Operator links [6]
> > * PR to update the doc version of flink-kubernetes-operator[7]
> >
> > **Vote Duration**
> >
> > The voting time will run for at least 72 hours.
> > It is adopted by majority approval, with at least 3 PMC affirmative votes.
> >
> > **Note on Verification**
> >
> > You can follow the basic verification guide here[8].
> > Note that you don't need to verify everything yourself, but please make
> > note of what you have tested together with your +- vote.
> >
> > [1]
> >
> > https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.6.1-rc1/
> > [2]
> > https://repository.apache.org/content/repositories/orgapacheflink-1663/
> > [3]
> >
> > https://github.com/apache/flink-kubernetes-operator/pkgs/container/flink-kubernetes-operator/139454270?tag=51eeae1
> > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [5]
> > https://github.com/apache/flink-kubernetes-operator/tree/release-1.6.1-rc1
> > [6] https://github.com/apache/flink-web/pull/690
> > [7] https://github.com/apache/flink-kubernetes-operator/pull/687
> > [8]
> >
> > https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> >
> > Best,
> > Rui
> >
>

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

2023-10-19 Thread Maximilian Michels
Hey Rui,

+1 for making exponential backoff the default. I agree with Konstantin
that retrying forever is a good default for exponential backoff
because oftentimes the issue will resolve eventually. The purpose of
exponential backoff is precisely to continue to retry without causing
too much load. However, I'm not against adding an optional max number
of retries.

-Max

On Thu, Oct 19, 2023 at 11:35 AM Konstantin Knauf  wrote:
>
> Hi Rui,
>
> Thank you for this proposal and working on this. I also agree that
> exponential back off makes sense as a new default in general. I think
> restarting indefinitely (no max attempts) makes sense by default, though,
> but of course allowing users to change is valuable.
>
> So, overall +1.
>
> Cheers,
>
> Konstantin
>
> Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan <1996fan...@gmail.com>:
>
> > Hi all,
> >
> > I would like to start a discussion on FLIP-364: Improve the
> > restart-strategy[1]
> >
> > As we know, the restart-strategy is critical for flink jobs, it mainly
> > has two functions:
> > 1. When an exception occurs in the flink job, quickly restart the job
> > so that the job can return to the running state.
> > 2. When a job cannot be recovered after frequent restarts within
> > a certain period of time, Flink will not retry but will fail the job.
> >
> > The current restart-strategy support for function 2 has some issues:
> > 1. The exponential-delay doesn't have the max attempts mechanism,
> > it means that flink will restart indefinitely even if it fails frequently.
> > 2. For multi-region streaming jobs and all batch jobs, the failure of
> > each region will increase the total number of job failures by +1,
> > even if these failures occur at the same time. If the number of
> > failures increases too quickly, it will be difficult to set a reasonable
> > number of retries.
> > If the maximum number of failures is set too low, the job can easily
> > reach the retry limit, causing the job to fail. If set too high, some jobs
> > will never fail.
> >
> > In addition, when the above two problems are solved, we can also
> > discuss whether exponential-delay can replace fixed-delay as the
> > default restart-strategy. In theory, exponential-delay is smarter and
> > friendlier than fixed-delay.
> >
> > I also thank Zhu Zhu for his suggestions on the option name in
> > FLINK-32895[2] in advance.
> >
> > Looking forward to and welcome everyone's feedback and suggestions, thank
> > you.
> >
> > [1] https://cwiki.apache.org/confluence/x/uJqzDw
> > [2] https://issues.apache.org/jira/browse/FLINK-32895
> >
> > Best,
> > Rui
> >
>
>
> --
> https://twitter.com/snntrable
> https://github.com/knaufk


Re: [VOTE] FLIP-361: Improve GC Metrics

2023-09-14 Thread Maximilian Michels
+1 (binding)

On Thu, Sep 14, 2023 at 4:26 AM Venkatakrishnan Sowrirajan
 wrote:
>
> +1 (non-binding)
>
> On Wed, Sep 13, 2023, 6:55 PM Matt Wang  wrote:
>
> > +1 (non-binding)
> >
> >
> > Thanks for driving this FLIP
> >
> >
> >
> >
> > --
> >
> > Best,
> > Matt Wang
> >
> >
> >  Replied Message 
> > | From | Xintong Song |
> > | Date | 09/14/2023 09:54 |
> > | To |  |
> > | Subject | Re: [VOTE] FLIP-361: Improve GC Metrics |
> > +1 (binding)
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Thu, Sep 14, 2023 at 2:40 AM Samrat Deb  wrote:
> >
> > +1 ( non binding)
> >
> > These improved GC metrics will be a great addition.
> >
> > Bests,
> > Samrat
> >
> > On Wed, 13 Sep 2023 at 7:58 PM, ConradJam  wrote:
> >
> > +1 (non-binding)
> > gc metrics help with autoscale tuning features
> >
> > Chen Zhanghao  于2023年9月13日周三 22:16写道:
> >
> > +1 (unbinding). Looking forward to it
> >
> > Best,
> > Zhanghao Chen
> > 
> > 发件人: Gyula Fóra 
> > 发送时间: 2023年9月13日 21:16
> > 收件人: dev 
> > 主题: [VOTE] FLIP-361: Improve GC Metrics
> >
> > Hi All!
> >
> > Thanks for all the feedback on FLIP-361: Improve GC Metrics [1][2]
> >
> > I'd like to start a vote for it. The vote will be open for at least 72
> > hours unless there is an objection or insufficient votes.
> >
> > Cheers,
> > Gyula
> >
> > [1]
> >
> >
> >
> >
> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-361*3A*Improve*GC*Metrics__;JSsrKw!!IKRxdwAv5BmarQ!dpHSOqsSHlPJ5gCvZ2yxSGjcR4xA2N-mpGZ1w2jPuKb78aujNpbzENmi1e7B26d6v4UQ8bQZ7IQaUcI$
> > [2]
> > https://urldefense.com/v3/__https://lists.apache.org/thread/qqqv54vyr4gbp63wm2d12q78m8h95xb2__;!!IKRxdwAv5BmarQ!dpHSOqsSHlPJ5gCvZ2yxSGjcR4xA2N-mpGZ1w2jPuKb78aujNpbzENmi1e7B26d6v4UQ8bQZFdEMnAg$
> >
> >
> >
> > --
> > Best
> >
> > ConradJam
> >
> >
> >


Re: [VOTE] FLIP-334: Decoupling autoscaler and kubernetes and support the Standalone Autoscaler

2023-09-13 Thread Maximilian Michels
+1 (binding)

On Wed, Sep 13, 2023 at 12:28 PM Gyula Fóra  wrote:
>
> +1 (binding)
>
> Gyula
>
> On Wed, 13 Sep 2023 at 09:33, Matt Wang  wrote:
>
> > Thank you for driving this FLIP,
> >
> > +1 (non-binding)
> >
> >
> > --
> >
> > Best,
> > Matt Wang
> >
> >
> >  Replied Message 
> > | From | ConradJam |
> > | Date | 09/13/2023 15:28 |
> > | To |  |
> > | Subject | Re: [VOTE] FLIP-334: Decoupling autoscaler and kubernetes and
> > support the Standalone Autoscaler |
> > best idea
> > +1 (non-binding)
> >
> > Ahmed Hamdy  于2023年9月13日周三 15:23写道:
> >
> > Hi Rui,
> > I have gone through the thread.
> > +1 (non-binding)
> >
> > Best Regards
> > Ahmed Hamdy
> >
> >
> > On Wed, 13 Sept 2023 at 03:53, Rui Fan <1996fan...@gmail.com> wrote:
> >
> > Hi all,
> >
> > Thanks for all the feedback about the FLIP-334:
> > Decoupling autoscaler and kubernetes and
> > support the Standalone Autoscaler[1].
> > This FLIP was discussed in [2].
> >
> > I'd like to start a vote for it. The vote will be open for at least 72
> > hours (until Sep 16th 11:00 UTC+8) unless there is an objection or
> > insufficient votes.
> >
> > [1]
> >
> >
> >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-334+%3A+Decoupling+autoscaler+and+kubernetes+and+support+the+Standalone+Autoscaler
> > [2] https://lists.apache.org/thread/kmm03gls1vw4x6vk1ypr9ny9q9522495
> >
> > Best,
> > Rui
> >
> >
> >
> >
> > --
> > Best
> >
> > ConradJam
> >


Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes

2023-09-06 Thread Maximilian Michels
Hey Rui, hey Samrat,

I want to ensure this is not just an exercise but has actual benefits
for the community. In the past, I've seen that the effort stops half
way through, the refactoring gets done with some regressions, but
actual alternative implementations based on the new design never
follow.

We need to go through these phases for the FLIP to be meaningful:

1. Decouple autoscaler from current autoscaler (generalization)
2. Ensure 100% functionality and test coverage of Kubernetes implementation
3. Interface with another backend (e.g. YARN or standalone)

If we don't follow through with this plan, I'm not sure we are better
off than with the current implementation. Apologies if I'm being a bit
strict here but the autoscaling code has become a critical
infrastructure component. We need to carefully weigh the pros and cons
here to avoid risks for our users, some of them using this code in
production and relying on it on a day to day basis.

That said, we are open to following through with the FLIP and we can
definitely help review code changes and build on the new design.

-Max


On Wed, Sep 6, 2023 at 11:26 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Hi Max,
>
> As the FLIP mentioned, we have the plan to add the
> alternative implementation.
>
> First of all, we will develop a generic autoscaler. This generic
> autoscaler will not have knowledge of specific jobs, and users
> will have the flexibility to pass the JobAutoScalerContext
> when utilizing the generic autoscaler. Communication with
> Flink jobs can be achieved through the RestClusterClient.
>
>- The generic ScalingRealizer is based on the rescale API (FLIP-291).
>- The generic EventHandler is based on the logger.
>- The generic StateStore is based on the Heap. This means that the state
>information is stored in memory and can be lost if the autoscaler restarts.
>
>
> Secondly, for yarn implementation, as Samrat mentioned,
> There is currently no flink-yarn-operator, and we cannot
> easily obtain the job list. We are not yet sure how to manage
> yarn's flink jobs. In order to prevent the FLIP from being too huge,
> after confirming with Gyula and Samrat before, it is decided
> that the current FLIP will not implement the automated
> yarn-autoscaler. And it will be a separate FLIP in the future.
>
>
> After this part is finished, flink users or other flink platforms can easy
> to use the autoscaler, they just pass the Context, and the autoscaler
> can find the flink job using the RestClient.
>
> The first part will be done in this FLIP. And we can discuss
> whether the second part should be done in this FLIP as well.
>
> Best,
> Rui
>
> On Wed, Sep 6, 2023 at 4:34 AM Samrat Deb  wrote:
>
> > Hi Max,
> >
> > > are we planning to add an alternative implementation
> > against the new interfaces?
> >
> > Yes, we are simultaneously working on the YARN implementation using the
> > interface. During the initial interface design, we encountered some
> > anomalies while implementing it in YARN.
> >
> > Once the interfaces are finalized, we will proceed to raise a pull request
> > (PR) for YARN as well.
> >
> > Our initial approach was to create a decoupled interface as part of
> > FLIP-334 and then implement it for YARN in the subsequent phase.
> > However, if you recommend combining both phases, we can certainly consider
> > that option.
> >
> > We look forward to hearing your thoughts on whether to have YARN
> > implementation as part of FLIP-334 or seperate one ?
> >
> > Bests
> > Samrat
> >
> >
> >
> > On Tue, Sep 5, 2023 at 8:41 PM Maximilian Michels  wrote:
> >
> > > Thanks Rui for the update!
> > >
> > > Alongside with the refactoring to decouple autoscaler logic from the
> > > deployment logic, are we planning to add an alternative implementation
> > > against the new interfaces? I think the best way to get the interfaces
> > > right, is to have an alternative implementation in addition to
> > > Kubernetes. YARN or a standalone mode implementation were already
> > > mentioned. Ultimately, this is the reason we are doing the
> > > refactoring. Without a new implementation, it becomes harder to
> > > justify the refactoring work.
> > >
> > > Cheers,
> > > Max
> > >
> > > On Tue, Sep 5, 2023 at 9:48 AM Rui Fan  wrote:
> > > >
> > > > After discussing this FLIP-334[1] offline with Gyula and Max,
> > > > I updated the FLIP based on the latest conclusion.
> > > >
> > > > Big thanks to Gyula and Max for their professiona

Re: [DISSCUSS] Kubernetes Operator Flink Version Support Policy

2023-09-05 Thread Maximilian Michels
+1 Sounds good! Four releases give a decent amount of time to migrate
to the next Flink version.

On Tue, Sep 5, 2023 at 5:33 PM Őrhidi Mátyás  wrote:
>
> +1
>
> On Tue, Sep 5, 2023 at 8:03 AM Thomas Weise  wrote:
>
> > +1, thanks for the proposal
> >
> > On Tue, Sep 5, 2023 at 8:13 AM Gyula Fóra  wrote:
> >
> > > Hi All!
> > >
> > > @Maximilian Michels  has raised the question of Flink
> > > version support in the operator before the last release. I would like to
> > > open this discussion publicly so we can finalize this before the next
> > > release.
> > >
> > > Background:
> > > Currently the Flink Operator supports all Flink versions since Flink
> > 1.13.
> > > While this is great for the users, it introduces a lot of backward
> > > compatibility related code in the operator logic and also adds
> > considerable
> > > time to the CI. We should strike a reasonable balance here that allows us
> > > to move forward and eliminate some of this tech debt.
> > >
> > > In the current model it is also impossible to support all features for
> > all
> > > Flink versions which leads to some confusion over time.
> > >
> > > Proposal:
> > > Since it's a key feature of the kubernetes operator to support several
> > > versions at the same time, I propose to support the last 4 stable Flink
> > > minor versions. Currently this would mean to support Flink 1.14-1.17 (and
> > > drop 1.13 support). When Flink 1.18 is released we would drop 1.14
> > support
> > > and so on. Given the Flink release cadence this means about 2 year
> > support
> > > for each Flink version.
> > >
> > > What do you think?
> > >
> > > Cheers,
> > > Gyula
> > >
> >


Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes

2023-09-05 Thread Maximilian Michels
Thanks Rui for the update!

Alongside with the refactoring to decouple autoscaler logic from the
deployment logic, are we planning to add an alternative implementation
against the new interfaces? I think the best way to get the interfaces
right, is to have an alternative implementation in addition to
Kubernetes. YARN or a standalone mode implementation were already
mentioned. Ultimately, this is the reason we are doing the
refactoring. Without a new implementation, it becomes harder to
justify the refactoring work.

Cheers,
Max

On Tue, Sep 5, 2023 at 9:48 AM Rui Fan  wrote:
>
> After discussing this FLIP-334[1] offline with Gyula and Max,
> I updated the FLIP based on the latest conclusion.
>
> Big thanks to Gyula and Max for their professional advice!
>
> > Does the interface function of handlerRecommendedParallelism
> > in AutoScalerEventHandler conflict with
> > handlerScalingFailure/handlerScalingReport (one of the
> > handles the event of scale failure, and the other handles
> > the event of scale success).
> Hi Matt,
>
> You can take a look at the FLIP, I think the issue has been fixed.
> Currently, we introduced the ScalingRealizer and
> AutoScalerEventHandler interface.
>
> The ScalingRealizer handles scaling action.
>
> The AutoScalerEventHandler  interface handles loggable events.
>
>
> Looking forward to your feedback, thanks!
>
> [1] https://cwiki.apache.org/confluence/x/x4qzDw
>
> Best,
> Rui
>
> On Thu, Aug 24, 2023 at 10:55 AM Matt Wang  wrote:
>>
>> Sorry for the late reply, I still have a small question here:
>> Does the interface function of handlerRecommendedParallelism
>> in AutoScalerEventHandler conflict with
>> handlerScalingFailure/handlerScalingReport (one of the
>> handles the event of scale failure, and the other handles
>> the event of scale success).
>>
>>
>>
>> --
>>
>> Best,
>> Matt Wang
>>
>>
>>  Replied Message 
>> | From | Rui Fan<1996fan...@gmail.com> |
>> | Date | 08/21/2023 17:41 |
>> | To |  |
>> | Cc | Maximilian Michels ,
>> Gyula Fóra ,
>> Matt Wang |
>> | Subject | Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes |
>> Hi Max, Gyula and Matt,
>>
>> Do you have any other comments?
>>
>> The flink-kubernetes-operator 1.6 has been released recently,
>> it's a good time to kick off this FLIP.
>>
>> Please let me know if you have any questions or concerns,
>> looking forward to your feedback, thanks!
>>
>> Best,
>> Rui
>>
>> On Wed, Aug 9, 2023 at 11:55 AM Rui Fan <1996fan...@gmail.com> wrote:
>>
>> Hi Matt Wang,
>>
>> Thanks for your discussion here.
>>
>> it is recommended to unify the descriptions of AutoScalerHandler
>> and AutoScalerEventHandler in the FLIP
>>
>> Good catch, I have updated all AutoScalerHandler to
>> AutoScalerEventHandler.
>>
>> Can it support the use of zookeeper (zookeeper is a relatively
>> common use of flink HA)?
>>
>> In my opinion, it's a good suggestion. However, I prefer we
>> implement other state stores in the other FLINK JIRA, and
>> this FLIP focus on the decoupling and implementing the
>> necessary state store. Does that make sense?
>>
>> Regarding each scaling information, can it be persisted in
>> the shared file system through the filesystem? I think it will
>> be a more valuable requirement to support viewing
>> Autoscaling info on the UI in the future, which can provide
>> some foundations in advance;
>>
>> This is a good suggestion as well. It's useful for users to check
>> the scaling information. I propose to add a CompositeEventHandler,
>> it can include multiple EventHandlers.
>>
>> However, as the last question, I prefer we implement other
>> event handler in the other FLINK JIRA. What do you think?
>>
>> A solution mentioned in FLIP is to initialize the
>> AutoScalerEventHandler object every time an event is
>> processed.
>>
>> No, the FLIP mentioned `The AutoScalerEventHandler  object is shared for
>> all flink jobs`,
>> So the AutoScalerEventHandler is only initialized once.
>>
>> And we call the AutoScalerEventHandler#handlerXXX
>> every time an event is processed.
>>
>> Best,
>> Rui
>>
>> On Tue, Aug 8, 2023 at 9:40 PM Matt Wang  wrote:
>>
>> Hi Rui
>>
>> Thanks for driving the FLIP.
>>
>> I agree with the point fo this FLIP. This FLIP first provides a
>> general function of Autoscaler in Flink repo, and there is no
>

Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-05 Thread Maximilian Michels
Hi Gyula,

+1 The proposed changes make sense and are in line with what is
available for other metrics, e.g. number of records processed.

-Max

On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra  wrote:
>
> Hi Devs,
>
> I would like to start a discussion on FLIP-361: Improve GC Metrics [1].
>
> The current Flink GC metrics [2] are not very useful for monitoring
> purposes as they require post processing logic that is also dependent on
> the current runtime environment.
>
> Problems:
>  - Total time is not very relevant for long running applications, only the
> rate of change (msPerSec)
>  - In most cases it's best to simply aggregate the time/count across the
> different GabrageCollectors, however the specific collectors are dependent
> on the current Java runtime
>
> We propose to improve the current situation by:
>  - Exposing rate metrics per GarbageCollector
>  - Exposing aggregated Total time/count/rate metrics
>
> These new metrics are all derived from the existing ones with minimal
> overhead.
>
> Looking forward to your feedback.
>
> Cheers,
> Gyula
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> [2]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection


[jira] [Created] (FLINK-32991) Some metrics from autoscaler never get registered

2023-08-29 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32991:
--

 Summary: Some metrics from autoscaler never get registered
 Key: FLINK-32991
 URL: https://issues.apache.org/jira/browse/FLINK-32991
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Deployment / Kubernetes
Affects Versions: kubernetes-operator-1.6.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.7.0


Not all metrics appear in the latest 1.6 release. This is because we report 
metrics as soon as they are available and the registration code assumes that 
they will all be available at once. In practice, some are only available after 
sampling data multiple times. For example, TARGET_DATA_RATE is only available 
after the source metrics have been aggregated and the lag has been computed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32992) Recommended parallelism metric is a duplicate of Parallelism metric

2023-08-29 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32992:
--

 Summary: Recommended parallelism metric is a duplicate of 
Parallelism metric
 Key: FLINK-32992
 URL: https://issues.apache.org/jira/browse/FLINK-32992
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Deployment / Kubernetes
Affects Versions: kubernetes-operator-1.6.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.7.0


The two metrics are the same. Recommended parallelism seems to have been added 
as a way to report real-time parallelism updates before we changed all metrics 
to be reported in real time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32960) Logic to log vertex exclusion only once does not work correctly

2023-08-25 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32960:
--

 Summary: Logic to log vertex exclusion only once does not work 
correctly
 Key: FLINK-32960
 URL: https://issues.apache.org/jira/browse/FLINK-32960
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.7.0


As part of daeb4b6559f0d26b1b0f23be5e8230f895b0a03e we wanted to log vertex 
exclusion only once. This logic does not work because the vertices without busy 
time are excluded in memory for every run. So we print "No busyTimeMsPerSecond 
metric available for {}. No scaling will be performed for this vertex." for 
every run.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32959) Operator standalone mode throws NoClassDefFoundError

2023-08-25 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32959:
--

 Summary: Operator standalone mode throws NoClassDefFoundError
 Key: FLINK-32959
 URL: https://issues.apache.org/jira/browse/FLINK-32959
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.0
Reporter: Maximilian Michels
 Fix For: kubernetes-operator-1.7.0


The standalone mode throws the following error:
{noformat}
Exception in thread "pool-7-thread-2" 
org.apache.flink.shaded.guava30.com.google.common.util.concurrent.ExecutionError:
 java.lang.NoClassDefFoundError: 
org/apache/flink/kubernetes/operator/standalone/StandaloneKubernetesConfigOptionsInternal
at 
org.apache.flink.shaded.guava30.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2049)
at 
org.apache.flink.shaded.guava30.com.google.common.cache.LocalCache.get(LocalCache.java:3962)
at 
org.apache.flink.shaded.guava30.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3985)
at 
org.apache.flink.shaded.guava30.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4946)
at 
org.apache.flink.kubernetes.operator.config.FlinkConfigManager.getConfig(FlinkConfigManager.java:298)
at 
org.apache.flink.kubernetes.operator.config.FlinkConfigManager.getDeployConfig(FlinkConfigManager.java:219)
at 
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentContext.getDeployConfig(FlinkDeploymentContext.java:48)
at 
org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:118)
at 
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:136)
at 
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:56)
at 
io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:138)
at 
io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:96)
at 
org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80)
at 
io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:95)
at 
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:139)
at 
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:119)
at 
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:89)
at 
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:62)
at 
io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:414)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NoClassDefFoundError: 
org/apache/flink/kubernetes/operator/standalone/StandaloneKubernetesConfigOptionsInternal
at 
org.apache.flink.kubernetes.operator.config.FlinkConfigBuilder.applyTaskManagerSpec(FlinkConfigBuilder.java:279)
at 
org.apache.flink.kubernetes.operator.config.FlinkConfigBuilder.buildFrom(FlinkConfigBuilder.java:405)
at 
org.apache.flink.kubernetes.operator.config.FlinkConfigManager.generateConfig(FlinkConfigManager.java:305)
at 
org.apache.flink.kubernetes.operator.config.FlinkConfigManager$1.load(FlinkConfigManager.java:106)
at 
org.apache.flink.kubernetes.operator.config.FlinkConfigManager$1.load(FlinkConfigManager.java:103)
at 
org.apache.flink.shaded.guava30.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
at 
org.apache.flink.shaded.guava30.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
at 
org.apache.flink.shaded.guava30.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
at 
org.apache.flink.shaded.guava30.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
``` {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32903) Add a load simulation example pipeline for canary testing

2023-08-21 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32903:
--

 Summary: Add a load simulation example pipeline for canary testing
 Key: FLINK-32903
 URL: https://issues.apache.org/jira/browse/FLINK-32903
 Project: Flink
  Issue Type: New Feature
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.7.0


The current provided example pipeline generates an infinite amount of load, 
only yielding only scaleup decisions. It would be desirable to have a pipeline 
which scales up and down in a periodic fashion. Users do not always have enough 
data available to do that. Hence, the load needs to be simulated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.6.0, release candidate #2

2023-08-14 Thread Maximilian Michels
+1 (binding)

1. Downloaded the archives, checksums, and signatures
2. Verified the signatures and checksums
3. Extract and inspect the source code for binaries
4. Compiled and tested the source code via mvn verify
5. Verified license files / headers
6. Deployed helm chart to test cluster
7. Ran example job
8. Tested autoscaling without rescaling API

-Max

On Mon, Aug 14, 2023 at 3:44 PM Márton Balassi  wrote:
>
> Thank you, team.
>
> +1 (binding)
>
> - Verified Helm repo works as expected, points to correct image tag, build,
> version
> - Verified basic examples + checked operator logs everything looks as
> expected
> - Verified hashes, signatures and source release contains no binaries
> - Ran built-in tests, built jars + docker image from source successfully
>
> Best,
> Marton
>
> On Mon, Aug 14, 2023 at 1:24 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Thanks Gyula for the release!
> >
> > +1 (non-binding)
> >
> > - Compiled and tested the source code via mvn verify
> > - Verified the signatures
> > - Downloaded the image
> > - Deployed helm chart to test cluster
> > - Ran example job
> >
> > Best,
> > Rui
> >
> > On Mon, Aug 14, 2023 at 3:58 PM Gyula Fóra  wrote:
> >
> > > +1 (binding)
> > >
> > > Verified:
> > >  - Hashes, signatures, source files contain no binaries
> > >  - Maven repo contents look good
> > >  - Verified helm chart, image, deployed stateful and autoscaling
> > examples.
> > > Operator logs look good
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Thu, Aug 10, 2023 at 3:03 PM Gyula Fóra  wrote:
> > >
> > > > Hi Everyone,
> > > >
> > > > Please review and vote on the release candidate #2 for the
> > > > version 1.6.0 of Apache Flink Kubernetes Operator,
> > > > as follows:
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > > **Release Overview**
> > > >
> > > > As an overview, the release consists of the following:
> > > > a) Kubernetes Operator canonical source distribution (including the
> > > > Dockerfile), to be deployed to the release repository at
> > dist.apache.org
> > > > b) Kubernetes Operator Helm Chart to be deployed to the release
> > > repository
> > > > at dist.apache.org
> > > > c) Maven artifacts to be deployed to the Maven Central Repository
> > > > d) Docker image to be pushed to dockerhub
> > > >
> > > > **Staging Areas to Review**
> > > >
> > > > The staging areas containing the above mentioned artifacts are as
> > > follows,
> > > > for your review:
> > > > * All artifacts for a,b) can be found in the corresponding dev
> > repository
> > > > at dist.apache.org [1]
> > > > * All artifacts for c) can be found at the Apache Nexus Repository [2]
> > > > * The docker image for d) is staged on github [3]
> > > >
> > > > All artifacts are signed with the key 21F06303B87DAFF1 [4]
> > > >
> > > > Other links for your review:
> > > > * JIRA release notes [5]
> > > > * source code tag "release-1.6.0-rc2" [6]
> > > > * PR to update the website Downloads page to
> > > > include Kubernetes Operator links [7]
> > > >
> > > > **Vote Duration**
> > > >
> > > > The voting time will run for at least 72 hours.
> > > > It is adopted by majority approval, with at least 3 PMC affirmative
> > > votes.
> > > >
> > > >
> > > > **Note on Verification**
> > > >
> > > > You can follow the basic verification guide here[8].
> > > > Note that you don't need to verify everything yourself, but please make
> > > > note of what you have tested together with your +- vote.
> > > >
> > > > Cheers!
> > > > Gyula Fora
> > > >
> > > > [1]
> > > >
> > >
> > https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.6.0-rc2/
> > > > [2]
> > > >
> > https://repository.apache.org/content/repositories/orgapacheflink-1649/
> > > > [3] ghcr.io/apache/flink-kubernetes-operator:ebb1fed
> > > > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > [5]
> > > >
> > >
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353230
> > > > [6]
> > > >
> > >
> > https://github.com/apache/flink-kubernetes-operator/tree/release-1.6.0-rc2
> > > > [7] https://github.com/apache/flink-web/pull/666
> > > > [8]
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> > > >
> > >
> >


Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler. Second vote.

2023-08-09 Thread Maximilian Michels
+1 (binding)

-Max

On Tue, Aug 8, 2023 at 10:56 AM Etienne Chauchot  wrote:
>
> Hi all,
>
> As part of Flink bylaws, binding votes for FLIP changes are active
> committer votes.
>
> Up to now, we have only 2 binding votes. Can one of the committers/PMC
> members vote on this FLIP ?
>
> Thanks
>
> Etienne
>
>
> Le 08/08/2023 à 10:19, Etienne Chauchot a écrit :
> >
> > Hi Joseph,
> >
> > Thanks for the detailled review !
> >
> > Best
> >
> > Etienne
> >
> > Le 14/07/2023 à 11:57, Prabhu Joseph a écrit :
> >> *+1 (non-binding)*
> >>
> >> Thanks for working on this. We have seen good improvement during the cool
> >> down period with this feature.
> >> Below are details on the test results from one of our clusters:
> >>
> >> On a scale-out operation, 8 new nodes were added one by one with a gap of
> >> ~30 seconds. There were 8 restarts within 4 minutes with the default
> >> behaviour,
> >> whereas only one with this feature (cooldown period of 4 minutes).
> >>
> >> The number of records processed by the job with this feature during the
> >> restart window is higher (2909764), whereas it is only 1323960 with the
> >> default
> >> behaviour due to multiple restarts, where it spends most of the time
> >> recovering, and also whatever work progressed by the tasks after the last
> >> successful completed checkpoint is lost.
> >>
> >> Metrics Default Adaptive Scheduler Adaptive Scheduler With Cooldown Period
> >> Remarks
> >> NumRecordsProcessed 1323960 2909764 1. NumRecordsProcessed metric indicates
> >> the difference the cool down period brings in. When the job is doing
> >> multiple restarts, the task spends most of the time recovering, and the
> >> progress the task made will be lost during the restart.
> >>
> >> 2. There is only one restart with Cool Down Period which happened when the
> >> 8th node got added back.
> >>
> >> Job Parallelism 13 -> 20 -> 27 -> 34 -> 41 -> 48 -> 55 → 62 → 69 13 → 69
> >> NumRestarts 8 1
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Jul 12, 2023 at 8:03 PM Etienne Chauchot
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I'm going on vacation tonight for 3 weeks.
> >>>
> >>> Even if the vote is not finished, as the implementation is rather quick
> >>> and the design discussion had settled, I preferred I implementing
> >>> FLIP-322 [1] to allow people to take a look while I'm off.
> >>>
> >>> [1]https://github.com/apache/flink/pull/22985
> >>>
> >>> Best
> >>>
> >>> Etienne
> >>>
> >>> Le 12/07/2023 à 09:56, Etienne Chauchot a écrit :
>  Hi all,
> 
>  Would you mind casting your vote to this second vote thread (opened
>  after new discussions) so that the subject can move forward ?
> 
>  @David, @Chesnay, @Robert you took part to the discussions, can you
>  please sent your vote ?
> 
>  Thank you very much
> 
>  Best
> 
>  Etienne
> 
>  Le 06/07/2023 à 13:02, Etienne Chauchot a écrit :
> > Hi all,
> >
> > Thanks for your feedback about the FLIP-322: Cooldown period for
> > adaptive scheduler [1].
> >
> > This FLIP was discussed in [2].
> >
> > I'd like to start a vote for it. The vote will be open for at least 72
> > hours (until July 9th 15:00 GMT) unless there is an objection or
> > insufficient votes.
> >
> > [1]
> >
> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler
> > [2]https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6
> >
> > Best,
> >
> > Etienne


[jira] [Created] (FLINK-32813) Split autoscaler ConfigMap

2023-08-08 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32813:
--

 Summary: Split autoscaler ConfigMap
 Key: FLINK-32813
 URL: https://issues.apache.org/jira/browse/FLINK-32813
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels


The autoscaler configmap stores the metrics of the current observation window, 
the past scaling decisions, and the parallelism overrides. We are already using 
compression and an eviction algorithm to avoid running over the 1MB limit.

We should think about breaking the ConfigMap apart and storing each of the 
three in its own ConfigMap.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [ANNOUNCE] New Apache Flink PMC Member - Matthias Pohl

2023-08-08 Thread Maximilian Michels
Congrats, well done, and welcome to the PMC Matthias!

-Max

On Tue, Aug 8, 2023 at 8:36 AM yh z  wrote:
>
> Congratulations, Matthias!
>
> Best,
> Yunhong Zheng (Swuferhong)
>
> Ryan Skraba  于2023年8月7日周一 21:39写道:
>
> > Congratulations Matthias -- very well-deserved, the community is lucky to
> > have you <3
> >
> > All my best, Ryan
> >
> > On Mon, Aug 7, 2023 at 3:04 PM Lincoln Lee  wrote:
> >
> > > Congratulations!
> > >
> > > Best,
> > > Lincoln Lee
> > >
> > >
> > > Feifan Wang  于2023年8月7日周一 20:13写道:
> > >
> > > > Congrats Matthias!
> > > >
> > > >
> > > >
> > > > ——
> > > > Name: Feifan Wang
> > > > Email: zoltar9...@163.com
> > > >
> > > >
> > > >  Replied Message 
> > > > | From | Matthias Pohl |
> > > > | Date | 08/7/2023 16:16 |
> > > > | To |  |
> > > > | Subject | Re: [ANNOUNCE] New Apache Flink PMC Member - Matthias Pohl
> > |
> > > > Thanks everyone. :)
> > > >
> > > > On Mon, Aug 7, 2023 at 3:18 AM Andriy Redko  wrote:
> > > >
> > > > Congrats Matthias, well deserved!!
> > > >
> > > > DC> Congrats Matthias!
> > > >
> > > > DC> Very well deserved, thankyou for your continuous, consistent
> > > > contributions.
> > > > DC> Welcome.
> > > >
> > > > DC> Thanks,
> > > > DC> Danny
> > > >
> > > > DC> On Fri, Aug 4, 2023 at 9:30 AM Feng Jin 
> > > wrote:
> > > >
> > > > Congratulations, Matthias!
> > > >
> > > > Best regards
> > > >
> > > > Feng
> > > >
> > > > On Fri, Aug 4, 2023 at 4:29 PM weijie guo 
> > > > wrote:
> > > >
> > > > Congratulations, Matthias!
> > > >
> > > > Best regards,
> > > >
> > > > Weijie
> > > >
> > > >
> > > > Wencong Liu  于2023年8月4日周五 15:50写道:
> > > >
> > > > Congratulations, Matthias!
> > > >
> > > > Best,
> > > > Wencong Liu
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > At 2023-08-04 11:18:00, "Xintong Song" 
> > > > wrote:
> > > > Hi everyone,
> > > >
> > > > On behalf of the PMC, I'm very happy to announce that Matthias Pohl
> > > > has
> > > > joined the Flink PMC!
> > > >
> > > > Matthias has been consistently contributing to the project since
> > > > Sep
> > > > 2020,
> > > > and became a committer in Dec 2021. He mainly works in Flink's
> > > > distributed
> > > > coordination and high availability areas. He has worked on many
> > > > FLIPs
> > > > including FLIP195/270/285. He helped a lot with the release
> > > > management,
> > > > being one of the Flink 1.17 release managers and also very active
> > > > in
> > > > Flink
> > > > 1.18 / 2.0 efforts. He also contributed a lot to improving the
> > > > build
> > > > stability.
> > > >
> > > > Please join me in congratulating Matthias!
> > > >
> > > > Best,
> > > >
> > > > Xintong (on behalf of the Apache Flink PMC)
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.6.0, release candidate #1

2023-08-07 Thread Maximilian Michels
Unfortunately, I've found an issue which might need to be fixed for
the release: https://issues.apache.org/jira/browse/FLINK-32774

-Max

On Wed, Aug 2, 2023 at 12:16 PM Gyula Fóra  wrote:
>
> +1 (binding)
>
> - Verified checksums and signatures, binaries, licenses
> - Tested release helm chart, docker image
> - Verified doc build, links
> - Ran basic stateful example, upgrade, savepoint. Checked logs, no errors
>
> Gyula
>
> On Mon, Jul 31, 2023 at 2:24 PM Maximilian Michels  wrote:
>
> > +1 (binding)
> >
> > 1. Downloaded the source and helm archives, checksums, and signatures
> > 2. Verified the signatures and checksums
> > 3. Extract and inspect the source code for binaries
> > 4. Compiled and tested the source code via mvn verify
> > 5. Verified license files / headers
> > 6. Deployed helm chart to test cluster
> > 7. Ran example job
> >
> > -Max
> >
> > On Sun, Jul 30, 2023 at 5:49 PM Mate Czagany  wrote:
> > >
> > > +1 (non-binding)
> > >
> > > - Verified checksums and signatures
> > > - Found no binaries in source
> > > - Helm chart points to correct docker image
> > > - Installed via remote helm repo
> > > - Reactive example up- and down-scaled with- and without reactive mode
> > > - Autoscale with Kafka source
> > > - HA stateful deployment, with savepoint and restart
> > > - 1.18 in-place autoscale with adaptive scheduler
> > >
> > > Best regards,
> > > Mate
> > >
> > > Rui Fan <1996fan...@gmail.com> ezt írta (időpont: 2023. júl. 30., V,
> > 6:20):
> > >
> > > > +1 (non-binding)
> > > >
> > > > - Compiled and tested the source code via mvn verify
> > > > - Verified the signatures
> > > > - Downloaded the image : docker pull
> > > > ghcr.io/apache/flink-kubernetes-operator:e7045a6
> > > > - Deployed helm chart to test cluster
> > > > - Ran example job
> > > >
> > > > Best,
> > > > Rui Fan
> > > >
> > > > On Thu, Jul 27, 2023 at 10:53 PM Gyula Fóra  wrote:
> > > >
> > > > > Hi Everyone,
> > > > >
> > > > > Please review and vote on the release candidate #1 for the version
> > 1.6.0
> > > > of
> > > > > Apache Flink Kubernetes Operator,
> > > > > as follows:
> > > > > [ ] +1, Approve the release
> > > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > > >
> > > > > **Release Overview**
> > > > >
> > > > > As an overview, the release consists of the following:
> > > > > a) Kubernetes Operator canonical source distribution (including the
> > > > > Dockerfile), to be deployed to the release repository at
> > dist.apache.org
> > > > > b) Kubernetes Operator Helm Chart to be deployed to the release
> > > > repository
> > > > > at dist.apache.org
> > > > > c) Maven artifacts to be deployed to the Maven Central Repository
> > > > > d) Docker image to be pushed to dockerhub
> > > > >
> > > > > **Staging Areas to Review**
> > > > >
> > > > > The staging areas containing the above mentioned artifacts are as
> > > > follows,
> > > > > for your review:
> > > > > * All artifacts for a,b) can be found in the corresponding dev
> > repository
> > > > > at dist.apache.org [1]
> > > > > * All artifacts for c) can be found at the Apache Nexus Repository
> > [2]
> > > > > * The docker image for d) is staged on github [3]
> > > > >
> > > > > All artifacts are signed with the key 21F06303B87DAFF1 [4]
> > > > >
> > > > > Other links for your review:
> > > > > * JIRA release notes [5]
> > > > > * source code tag "release-1.6.0-rc1" [6]
> > > > > * PR to update the website Downloads page to
> > > > > include Kubernetes Operator links [7]
> > > > >
> > > > > **Vote Duration**
> > > > >
> > > > > The voting time will run for at least 72 hours.
> > > > > It is adopted by majority approval, with at least 3 PMC affirmative
> > > > votes.
> > > > >
> > > > > **Note on Verification**
> > > > >
> > > > > You can follow the basic verification guide here[8].
> > > > > Note that you don't need to verify everything yourself, but please
> > make
> > > > > note of what you have tested together with your +- vote.
> > > > >
> > > > > Cheers!
> > > > > Gyula Fora
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.6.0-rc1/
> > > > > [2]
> > > > >
> > https://repository.apache.org/content/repositories/orgapacheflink-1646/
> > > > > [3] ghcr.io/apache/flink-kubernetes-operator:e7045a6
> > > > > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > > [5]
> > > > >
> > > > >
> > > >
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353230
> > > > > [6]
> > > > >
> > > >
> > https://github.com/apache/flink-kubernetes-operator/tree/release-1.6.0-rc1
> > > > > [7] https://github.com/apache/flink-web/pull/666
> > > > > [8]
> > > > >
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> > > > >
> > > >
> >


[jira] [Created] (FLINK-32774) Reconciliation for autoscaling overrides gets stuck after cancel-with-savepoint

2023-08-07 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32774:
--

 Summary: Reconciliation for autoscaling overrides gets stuck after 
cancel-with-savepoint
 Key: FLINK-32774
 URL: https://issues.apache.org/jira/browse/FLINK-32774
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels


Since https://issues.apache.org/jira/browse/FLINK-32589 the operator does not 
rely on the Flink configuration anymore to store the parallelism overrides. 
Instead, it stores them internally in the autoscaler config map. Upon scalings 
without the rescaling API, the spec is changed on the fly during reconciliation 
and the parallelism overrides are added.

Unfortunately, this yields to the cluster getting stuck with the job in 
FINISHED state after taking a savepoint for upgrade. The operator assumes that 
the new cluster got deployed successfully and goes into DEPLOYED state again.

Log flow (from oldest to newest):
 # Rescheduling new reconciliation immediately to execute scaling operation.
 # Upgrading/Restarting running job, suspending first...
 # Job is in running state, ready for upgrade with SAVEPOINT
 # Suspending existing deployment.
 # Suspending job with savepoint.
 # Job successfully suspended with savepoint
 # The resource is being upgraded
 # Pending upgrade is already deployed, updating status.
 # Observing JobManager deployment. Previous status: DEPLOYING
 # JobManager deployment port is ready, waiting for the Flink REST API...
 # DEPLOYED The resource is deployed/submitted to Kubernetes, but it’s not yet 
considered to be stable and might be rolled back in the future

It appears the issue might be in (8): 
[https://github.com/apache/flink-kubernetes-operator/blob/c09671c5c51277c266b8c45d493317d3be1324c0/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L260]
 because the generation id hasn't been changed by the mere parallelism override 
change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-334 : Decoupling autoscaler and kubernetes

2023-08-01 Thread Maximilian Michels
Hi Rui,

Thanks for the proposal. I think it makes a lot of sense to decouple
the autoscaler from Kubernetes-related dependencies. A couple of notes
when I read the proposal:

1. You propose AutoScalerEventHandler, AutoScalerStateStore,
AutoScalerStateStoreFactory, and AutoScalerEventHandler.
AutoscalerStateStore is a generic key/value database (methods:
"get"/"put"/"delete"). I would propose to refine this interface and
make it less general purpose, e.g. add a method for persisting scaling
decisions as well as any metrics gathered for the current metric
window. For simplicity, I'd even go so far to remove the state store
entirely, but rather handle state in the AutoScalerEventHandler which
will receive all related scaling and metric collection events, and can
keep track of any state.

2. You propose to make the current autoscaler module
Kubernetes-agnostic by moving the Kubernetes parts into the main
operator module. I think that makes sense since the Kubernetes
implementation will continue to be tightly coupled with Kubernetes.
The goal of the separate module was to make the autoscaler logic
pluggable, but this will continue to be possible with the new
"flink-autoscaler" module which contains the autoscaling logic and
interfaces. In the long run, the autoscaling logic can move to a
separate repository, although this will complicate the release
process, so I would defer this unless there is strong interest.

3. The proposal mentions some removal of tests. It is critical for us
that all test coverage of the current implementation remains active.
It is ok if some of the test coverage only covers the Kubernetes
implementation. We can eventually move more tests without Kubernetes
significance into the implementation-agnostic autoscaler tests.

-Max

On Tue, Aug 1, 2023 at 9:46 AM Rui Fan  wrote:
>
> Hi all,
>
> I and Samrat(cc'ed) created the FLIP-334[1] to decoupling the autoscaler
> and kubernetes.
>
> Currently, the flink-autoscaler is tightly integrated with Kubernetes.
> There are compelling reasons to extend the use of flink-autoscaler to
> more types of Flink jobs:
> 1. With the recent merge of the Externalized Declarative Resource
> Management (FLIP-291[2]), in-place scaling is now supported
> across all types of Flink jobs. This development has made scaling Flink on
> YARN a straightforward process.
> 2. Several discussions[3] within the Flink user community, as observed in
> the mail list , have emphasized the necessity of flink-autoscaler
> supporting
> Flink on YARN.
>
> Please refer to the FLIP[1] document for more details about the proposed
> design and implementation. We welcome any feedback and opinions on
> this proposal.
>
> [1] https://cwiki.apache.org/confluence/x/x4qzDw
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> [3] https://lists.apache.org/thread/pr0r8hq8kqpzk3q1zrzkl3rp1lz24v7v


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.6.0, release candidate #1

2023-07-31 Thread Maximilian Michels
+1 (binding)

1. Downloaded the source and helm archives, checksums, and signatures
2. Verified the signatures and checksums
3. Extract and inspect the source code for binaries
4. Compiled and tested the source code via mvn verify
5. Verified license files / headers
6. Deployed helm chart to test cluster
7. Ran example job

-Max

On Sun, Jul 30, 2023 at 5:49 PM Mate Czagany  wrote:
>
> +1 (non-binding)
>
> - Verified checksums and signatures
> - Found no binaries in source
> - Helm chart points to correct docker image
> - Installed via remote helm repo
> - Reactive example up- and down-scaled with- and without reactive mode
> - Autoscale with Kafka source
> - HA stateful deployment, with savepoint and restart
> - 1.18 in-place autoscale with adaptive scheduler
>
> Best regards,
> Mate
>
> Rui Fan <1996fan...@gmail.com> ezt írta (időpont: 2023. júl. 30., V, 6:20):
>
> > +1 (non-binding)
> >
> > - Compiled and tested the source code via mvn verify
> > - Verified the signatures
> > - Downloaded the image : docker pull
> > ghcr.io/apache/flink-kubernetes-operator:e7045a6
> > - Deployed helm chart to test cluster
> > - Ran example job
> >
> > Best,
> > Rui Fan
> >
> > On Thu, Jul 27, 2023 at 10:53 PM Gyula Fóra  wrote:
> >
> > > Hi Everyone,
> > >
> > > Please review and vote on the release candidate #1 for the version 1.6.0
> > of
> > > Apache Flink Kubernetes Operator,
> > > as follows:
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > > **Release Overview**
> > >
> > > As an overview, the release consists of the following:
> > > a) Kubernetes Operator canonical source distribution (including the
> > > Dockerfile), to be deployed to the release repository at dist.apache.org
> > > b) Kubernetes Operator Helm Chart to be deployed to the release
> > repository
> > > at dist.apache.org
> > > c) Maven artifacts to be deployed to the Maven Central Repository
> > > d) Docker image to be pushed to dockerhub
> > >
> > > **Staging Areas to Review**
> > >
> > > The staging areas containing the above mentioned artifacts are as
> > follows,
> > > for your review:
> > > * All artifacts for a,b) can be found in the corresponding dev repository
> > > at dist.apache.org [1]
> > > * All artifacts for c) can be found at the Apache Nexus Repository [2]
> > > * The docker image for d) is staged on github [3]
> > >
> > > All artifacts are signed with the key 21F06303B87DAFF1 [4]
> > >
> > > Other links for your review:
> > > * JIRA release notes [5]
> > > * source code tag "release-1.6.0-rc1" [6]
> > > * PR to update the website Downloads page to
> > > include Kubernetes Operator links [7]
> > >
> > > **Vote Duration**
> > >
> > > The voting time will run for at least 72 hours.
> > > It is adopted by majority approval, with at least 3 PMC affirmative
> > votes.
> > >
> > > **Note on Verification**
> > >
> > > You can follow the basic verification guide here[8].
> > > Note that you don't need to verify everything yourself, but please make
> > > note of what you have tested together with your +- vote.
> > >
> > > Cheers!
> > > Gyula Fora
> > >
> > > [1]
> > >
> > >
> > https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.6.0-rc1/
> > > [2]
> > > https://repository.apache.org/content/repositories/orgapacheflink-1646/
> > > [3] ghcr.io/apache/flink-kubernetes-operator:e7045a6
> > > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > [5]
> > >
> > >
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353230
> > > [6]
> > >
> > https://github.com/apache/flink-kubernetes-operator/tree/release-1.6.0-rc1
> > > [7] https://github.com/apache/flink-web/pull/666
> > > [8]
> > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> > >
> >


Re: [DISCUSS] Flink Kubernetes Operator cleanup procedure

2023-07-18 Thread Maximilian Michels
Hi Daren,

The behavior is consistent with the regular FlinkDeployment where the
cleanup will also cancel any running jobs. Are you intending to
recover jobs from another session cluster?

-Max

On Mon, Jul 17, 2023 at 4:48 PM Wong, Daren
 wrote:
>
> Hi devs,
>
> I would like to enquire about the cleanup procedure upon FlinkSessionJob 
> deletion. Currently, FlinkSessionJobController would trigger a cleanup in the 
> SessionJobReconciler which in turn cancels the Flink Job.
>
> Link to code: 
> https://github.com/apache/flink-kubernetes-operator/blob/371a2e6bbb8008c8ffccfff8fc338fb39bda19e2/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/sessionjob/SessionJobReconciler.java#L110
>
> This make sense to me as we want to ensure the Flink Job is ended gracefully 
> when the FlinkSessionJob associated with it is deleted. Otherwise, the Flink 
> Job will be “leaked” in the Flink cluster without a FlinkSessionJob 
> associated to it for the Kubernetes Operator to control.
>
> That being said, I was wondering if we should consider for scenarios where 
> users may not want FlinkSessionJob deletion to create a side-effect such as 
> cancelJob? For example, the user just wants to simply delete the whole 
> namespace. One way of achieving this could be to put the job in SUSPENDED 
> state instead of cancelling the job.
>
> I am opening this discussion thread to get feedback and input on whether this 
> alternative cleanup procedure is worth considering and if anyone else see any 
> potential use case/benefits/downsides with it?
>
> Thank you very much.
>
> Regards,
> Daren


Re: FLIP-401: Remove brackets around keys returned by MetricGroup#getAllVariables

2023-07-18 Thread Maximilian Michels
Hi Chesnay,

+1 Sounds good to me!

-Max

On Tue, Jul 18, 2023 at 10:59 AM Chesnay Schepler  wrote:
>
> MetricGroup#getAllVariables returns all variables associated with the
> metric, for example:
>
> | = abcde|
> | = ||0|
>
> The keys are surrounded by brackets for no particular reason.
>
> In virtually every use-case for this method the user is stripping the
> brackets from keys, as done in:
>
>   * our datadog reporter:
> 
> https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L244
> 
> 
>   * our prometheus reporter (implicitly via a character filter):
> 
> https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/AbstractPrometheusReporter.java#L236
> 
> 
>   * our JMX reporter:
> 
> https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-jmx/src/main/java/org/apache/flink/metrics/jmx/JMXReporter.java#L223
> 
> 
>
> I propose to change the method spec and implementation to remove the
> brackets around keys.
>
> For migration purposes it may make sense to add a new method with the
> new behavior (|getVariables()|) and deprecate the old method.
>
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263425202


Re: [DISCUSS][2.0] FLIP-340: Remove rescale REST endpoint

2023-07-18 Thread Maximilian Michels
+1

On Tue, Jul 18, 2023 at 12:29 PM Gyula Fóra  wrote:
>
> +1
>
> On Tue, 18 Jul 2023 at 12:12, Xintong Song  wrote:
>
> > +1
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Tue, Jul 18, 2023 at 4:25 PM Chesnay Schepler 
> > wrote:
> >
> > > The endpoint hasn't been working for years and was only kept to inform
> > > users about it. Let's finally remove it.
> > >
> > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-340%3A+Remove+rescale+REST+endpoint
> > >
> > >
> >


Re: [DISCUSS] FLIP 333 - Redesign Apache Flink website

2023-07-17 Thread Maximilian Michels
+1

On Mon, Jul 17, 2023 at 10:45 AM Chesnay Schepler  wrote:
>
> +1
>
> On 16/07/2023 08:10, Mohan, Deepthi wrote:
> > @Chesnay
> >
> > Thank you for your feedback.
> >
> > An important takeaway from the previous discussion [1] and your feedback 
> > was to keep the design and text/diagram changes separate as each change for 
> > text and diagrams likely require deeper discussion. Therefore, as a first 
> > step I am proposing only UX changes with minimal text changes for the pages 
> > mentioned in the FLIP.
> >
> > The feedback we received from customers cover both aesthetics and 
> > functional aspects of the website. Note that most feedback is focused only 
> > on the main Flink website [2].
> >
> > 1) New customers who are considering Flink have said about the website 
> > “there is a lot going on”, “looks too complicated”, “I am not sure *why* I 
> > should use this" and similar feedback. The proposed redesign in this FLIP 
> > helps partially address this category of feedback, but we may need to make 
> > the use cases and value proposition “pop” more than we have currently 
> > proposed in the redesign. I’d like to get the community’s thoughts on this.
> >
> > 2) On the look and feel of the website, I’ve already shared feedback prior 
> > that I am repeating here: “like a wiki page thrown together by developers.” 
> > Customers also point out other related Apache project websites: [3] and [4] 
> > as having “modern” user design. The proposed redesign in this FLIP will 
> > help address this feedback. Modernizing the look and feel of the website 
> > will appeal to customers who are used to what they encounter on other 
> > contemporary websites.
> >
> > 3) New and existing Flink developers have said “I am not sure what the 
> > diagram is supposed to depict” - referencing the main diagram on [2] and 
> > have said that the website lacks useful graphics and colors. Apart from 
> > removing the diagram on the main page [2], the current FLIP does propose 
> > major changes to diagrams in the rest of website and we can discuss them 
> > separately as they become available. I’d like to keep the FLIP focused only 
> > on the website redesign.
> >
> > Ultimately, to Chesnay’s point in the earlier discussion in [1], I do not 
> > want to boil the ocean with all the changes at once. In this FLIP, my 
> > proposal is to first work on the UX design as that gives us a good starting 
> > point. We can use it as a framework to make iterative changes and 
> > enhancements to diagrams and the actual website content incrementally.
> >
> > I’ve added a few more screenshots of additional pages to the FLIP that will 
> > give you a clearer picture of the proposed changes for the main page, What 
> > is Flink [Architecture, Applications, and Operations] pages.
> >
> > And finally, I am not proposing any tooling changes.
> >
> > [1] https://lists.apache.org/thread/c3pt00cf77lrtgt242p26lgp9l2z5yc8
> > [2]https://flink.apache.org/
> > [3] https://spark.apache.org/
> > [4] https://kafka.apache.org/
> >
> > On 7/13/23, 6:25 AM, "Chesnay Schepler"  > > wrote:
> >
> >
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> >
> >
> >
> >
> >
> >
> > On 13/07/2023 08:07, Mohan, Deepthi wrote:
> >> However, even these developers when explicitly asked in our conversations 
> >> often comment that the website could do with a redesign
> >
> > Can you go into more detail as to their specific concerns? Are there
> > functional problems with the page, or is this just a matter of "I don't
> > like the way it looks"?
> >
> >
> > What had they trouble with? Which information was
> > missing/unnecessary/too hard to find?
> >
> >
> > The FLIP states that "/we want to modernize the website so that new and
> > existing users can easily find information to understand what Flink is,
> > the primary use cases where Flink is useful, and clearly understand its
> > value proposition/."
> >
> >
> >  From the mock-ups I don't /really/ see how these stated goals are
> > achieved. It mostly looks like a fresh coat of paint, with a compressed
> > nav bar (which does reduce how much information and links we throw at
> > people at once (which isn't necessarily bad)).
> >
> >
> > Can you go into more detail w.r.t. to the proposed
> > text/presentation/diagram changes?
> >
> >
> > I assume you are not proposing any tooling changes?
> >
> >
> >
> >
> >
>


[jira] [Created] (FLINK-32589) Carry over parallelism overrides to prevent users from clearing them on updates

2023-07-13 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32589:
--

 Summary: Carry over parallelism overrides to prevent users from 
clearing them on updates
 Key: FLINK-32589
 URL: https://issues.apache.org/jira/browse/FLINK-32589
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels


The autoscaler currently sets the parallelism overrides via the Flink config 
{{pipeline.jobvertex-parallelism-overrides}}. Whenever the user posts specs 
updates, special care needs to be taken in order to carry over existing 
overrides. Otherwise the job will reset to the default parallelism 
configuration. Users shouldn't have to deal with this. Instead, whenever a new 
spec is posted which does not contain the overrides, the operator should 
automatically apply the last-used overrides (if autoscaling is enabled).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32170) Continue metric collection on intermittant job restarts

2023-05-23 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32170:
--

 Summary: Continue metric collection on intermittant job restarts
 Key: FLINK-32170
 URL: https://issues.apache.org/jira/browse/FLINK-32170
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels


If the underlying infrastructure is not stable, e.g. Kubernetes pod eviction, 
the jobs will sometimes restart. This will restart the metric collection 
process for the autoscaler and discard any existing metrics. If the 
interruption time is short, e.g. less than one minute, we could consider 
resuming metric collection after the job goes back into RUNNING state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.5.0 released

2023-05-23 Thread Maximilian Michels
Niceee. Thanks for managing the release, Gyula!

-Max

On Wed, May 17, 2023 at 8:25 PM Márton Balassi  wrote:
>
> Thanks, awesome! :-)
>
> On Wed, May 17, 2023 at 2:24 PM Gyula Fóra  wrote:
>>
>> The Apache Flink community is very happy to announce the release of Apache 
>> Flink Kubernetes Operator 1.5.0.
>>
>> The Flink Kubernetes Operator allows users to manage their Apache Flink 
>> applications and their lifecycle through native k8s tooling like kubectl.
>>
>> Release highlights:
>>  - Autoscaler improvements
>>  - Operator stability, observability improvements
>>
>> Release blogpost:
>> https://flink.apache.org/2023/05/17/apache-flink-kubernetes-operator-1.5.0-release-announcement/
>>
>> The release is available for download at: 
>> https://flink.apache.org/downloads.html
>>
>> Maven artifacts for Flink Kubernetes Operator can be found at: 
>> https://search.maven.org/artifact/org.apache.flink/flink-kubernetes-operator
>>
>> Official Docker image for Flink Kubernetes Operator applications can be 
>> found at: https://hub.docker.com/r/apache/flink-kubernetes-operator
>>
>> The full release notes are available in Jira:
>> https://issues.apache.org/jira/projects/FLINK/versions/12352931
>>
>> We would like to thank all contributors of the Apache Flink community who 
>> made this release possible!
>>
>> Regards,
>> Gyula Fora


[jira] [Created] (FLINK-32120) Add autoscaler config option to disable parallelism key group alignment

2023-05-17 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32120:
--

 Summary: Add autoscaler config option to disable parallelism key 
group alignment
 Key: FLINK-32120
 URL: https://issues.apache.org/jira/browse/FLINK-32120
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.6.0, kubernetes-operator-1.5.1


After choosing the target parallelism for a vertex, we choose a higher 
parallelism if that parallelism leads to evenly spreading the number of key 
groups. The number of key groups is derived from the max parallelism.

The amount of actual skew we would introduce if we did not do the alignment 
would usually be pretty low. In fact, the data itself can have an uneven load 
distribution across the keys (hot keys). In this case, the key group alignment 
is not effective.

For experiments, we should allow disabling the key group alignment via a 
configuration option.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32119) Disable key group alignment for determining source parallelism

2023-05-17 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32119:
--

 Summary: Disable key group alignment for determining source 
parallelism
 Key: FLINK-32119
 URL: https://issues.apache.org/jira/browse/FLINK-32119
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.6.0, kubernetes-operator-1.5.1


After choosing the target parallelism for a vertex, we choose a higher 
parallelism if that parallelism leads to evenly spreading the number of key 
groups.

For one, sources don't have keyed state, so this adjustment does not make sense.

For another, we internally limit the max parallelism of sources to the number 
of partitions discovered. This makes the adjustment even less meaningful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.5.0, release candidate #2

2023-05-16 Thread Maximilian Michels
+1 (binding)

1. Downloaded the archives, checksums, and signatures
2. Verified the signatures and checksums
3. Extract and inspect the source code for binaries
4. Verified license files / headers
5. Compiled and tested the source code via mvn verify
6. Deployed helm chart to test cluster
7. Ran example job

-Max

On Tue, May 16, 2023 at 3:10 AM Jim Busche  wrote:
>
> +1 (Non-binding)
> I tested the following:
>
> - helm repo install from flink-kubernetes-operator-1.5.0-helm.tgz (See note 1 
> below)
> - podman Dockerfile build from source, looked good. (See note 2 below)
> - twistlock vulnerability scans of proposed 
> ghcr.io/apache/flink-kubernetes-operator:be07be7 looks good, except for known 
> Snake item.
> - UI, basic sample, basic session jobs look good. Logs look as expected.
> - Checksums looked good
> - Tested OLM build/install on OpenShift 4.10.54 and OpenShift 4.12.7
>
> Note 1: To install on OpenShift, I had to add an extra flink-operator 
> clusterrole resource.  See 
> https://github.com/apache/flink-kubernetes-operator/pull/600 and issue 
> https://issues.apache.org/jira/browse/FLINK-32103
>
> Note 2: For some reason, I can't use podman on Red Hat 8 to build Flink, but 
> the Podman from Red Hat 9.0 worked fine.
>
>
> Thanks, Jim


[jira] [Created] (FLINK-32102) Aggregate multiple pendingRecords metric per source if present

2023-05-15 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32102:
--

 Summary: Aggregate multiple pendingRecords metric per source if 
present
 Key: FLINK-32102
 URL: https://issues.apache.org/jira/browse/FLINK-32102
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Affects Versions: kubernetes-operator-1.4.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.6.0


Some source expose multiple {{.pendingRecords}} metrics. If that is the case, 
we must sum up these records to yield the correct internal pending records 
count.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32100) Max parallelism is incorrectly calculated with multiple topics

2023-05-15 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32100:
--

 Summary: Max parallelism is incorrectly calculated with multiple 
topics
 Key: FLINK-32100
 URL: https://issues.apache.org/jira/browse/FLINK-32100
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Kubernetes Operator
Affects Versions: kubernetes-operator-1.4.0
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.5.0


We are taking the max number partitions we can find. However, the correct way 
to calculate the max parallelism would be to sum the number of partitions of 
all topis.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32061) Resource metric groups are not cleaned up on removal

2023-05-11 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32061:
--

 Summary: Resource metric groups are not cleaned up on removal
 Key: FLINK-32061
 URL: https://issues.apache.org/jira/browse/FLINK-32061
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.5.0


Not cleaning up leaks memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32005) Add a per-deployment error metric to signal about potential issues

2023-05-04 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32005:
--

 Summary: Add a per-deployment error metric to signal about 
potential issues
 Key: FLINK-32005
 URL: https://issues.apache.org/jira/browse/FLINK-32005
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.5.0


If any of the autoscaled deployment produce errors they are only visible in the 
logs or in the k8s events. Additionally, it would be good to have metrics to 
detect any potential issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32002) Revisit autoscaler defaults for next release

2023-05-04 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-32002:
--

 Summary: Revisit autoscaler defaults for next release
 Key: FLINK-32002
 URL: https://issues.apache.org/jira/browse/FLINK-32002
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.5.0


We haven't put much thought into the defaults. We should revisit and adjust the 
defaults to fit most use cases without being overly aggressive.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Planning Flink 2.0

2023-04-26 Thread Maximilian Michels
Thanks for starting the discussion, Jark and Xingtong!

Flink 2.0 is long overdue. In the past, the expectations for such a
release were unreasonably high. I think everybody had a different
understanding of what exactly the criteria were. This led to releasing
18 minor releases for the current major version.

What I'm most excited about for Flink 2.0 is removal of baggage that
Flink has accumulated over the years:

- Removal of Scala, deprecated interfaces, unmaintained libraries and
APIs (DataSet)
- Consolidation of configuration
- Merging of multiple scheduler implementations
- Ability to freely combine batch / streaming tasks in the runtime

When I look at 
https://docs.google.com/document/d/1_PMGl5RuDQGlV99_gL3y7OiRsF0DgCk91Coua6hFXhE/edit
, I'm a bit skeptical we will even be able to reach all these goals. I
think we have to prioritize and try to establish a deadline. Otherwise
we will end up never releasing 2.0.

+1 on Flink 2.0 by May 2024 (not a hard deadline but I think having a
deadline helps).

-Max


On Wed, Apr 26, 2023 at 10:08 AM Chesnay Schepler  wrote:
>
>  > /Instead of defining compatibility guarantees as "this API won't
> change in all 1.x/2.x series", what if we define it as "this API won't
> change in the next 2/3 years"./
>
> I can see some benefits to this approach (all APIs having a fixed
> minimum lifetime) but it's just gonna be difficult to communicate.
> Technically this implies that every minor release may contain breaking
> changes, which is exactly what users don't want.
>
> What problems to do you see in creating major releases every N years?
>
>  > /IIUC, the milestone releases are a breakdown of the 2.0 release,
> while we are free to introduce breaking changes between them. And you
> suggest using longer-living feature branches to keep the master branch
> in a releasable state (in terms of milestone releases). Am I
> understanding it correctly?/
>
> I think you got the general idea. There are a lot of details to be
> ironed out though (e.g., do we release connectors for each milestone?...).
>
> Conflicts in the long-lived branches are certainly a concern, but I
> think those will be inevitable. Right now I'm not _too_ worried about
> them, at least based on my personal wish-list.
> Maybe the milestones could even help with that, as we could preemptively
> decide on an order for certain changes that have a high chance of
> conflicting with each other?
> I guess we could do that anyway.
> Maybe we should explicitly evaluate how invasive a change is (in
> relation to other breaking changes!) and manage things accordingly
>
>
> Other thoughts:
>
> We need to figure out what this release means for connectors
> compatibility-wise. The current rules for which versions a connector
> must support don't cover major releases at all.
> (This depends a bit on the scope of 2.0; if we add binary compatibility
> to Public APIs and promote a few Evolving ones then compatibility across
> minor releases becomes trivial)
>
> What process are you thinking of for deciding what breaking changes to
> make? The obvious choice would be FLIPs, but I'm worried that this will
> overload the mailing list / wiki for lots of tiny changes.
>
> Provided that we agree on doing 2.0, when would we cut the 2.0 branch?
> Would we wait a few months for people to prepare/agree on changes so we
> reduce the time we need to merge things into 2 branches?
>
> On 26/04/2023 05:51, Xintong Song wrote:
> > Thanks all for the positive feedback.
> >
> > @Martijn
> >
> > If we want to have that roadmap, should we consolidate this into a
> >> dedicated Confluence page over storing it in a Google doc?
> >>
> > Having a dedicated wiki page is definitely a good way for the roadmap
> > discussion. I haven't created one yet because it's still a proposal to have
> > such roadmap discussion. If the community agrees with our proposal, the
> > release manager team can decide how they want to drive and track the
> > roadmap discussion.
> >
> > @Chesnay
> >
> > We should discuss how regularly we will ship major releases from now on.
> >> Let's avoid again making breaking changes because we "gotta do it now
> >> because 3.0 isn't happening anytime soon". (e.g., every 2 years or
> >> something)
> >
> > I'm not entirely sure about shipping major releases regularly. But I do
> > agree that we may want to avoid the situation that "breaking changes can
> > only happen now, or no idea when". Instead of defining compatibility
> > guarantees as "this API won't change in all 1.x/2.x series", what if we
> > define it as "this API won't change in the next 2/3 years". That should
> > allow us to incrementally iterate the APIs.
> >
> > E.g., in 2.a, all APIs annotated as `@Stable` will be guaranteed compatible
> > until 2 years after 2.a is shipped, and in 2.b if the API is still
> > annotated `@Stable` it extends the compatibility guarantee to 2 years after
> > 2.b is shipped. To remove an API, we would need to mark it as `@Deprecated`
> > and w

Re: [DISCUSS] Preventing Mockito usage for the new code with Checkstyle

2023-04-26 Thread Maximilian Michels
If we ban Mockito imports, I can still write tests using the full
qualifiers, right?

For example:

org.mockito.Mockito.when(somethingThatShouldHappen).thenReturn(somethingThatNeverActuallyHappens)

Just kidding, +1 on the proposal.

-Max

On Wed, Apr 26, 2023 at 9:02 AM Panagiotis Garefalakis
 wrote:
>
> Thanks for bringing this up!  +1 for the proposal
>
> @Jing Ge -- we don't necessarily need to completely migrate to Junit5 (even
> though it would be ideal).
> We could introduce the checkstyle rule and add suppressions for the
> existing problematic paths (as we do today for other rules e.g.,
> AvoidStarImport)
>
> Cheers,
> Panagiotis
>
> On Tue, Apr 25, 2023 at 11:48 PM Weihua Hu  wrote:
>
> > Thanks for driving this.
> >
> > +1 for Mockito and Junit4.
> >
> > A clarity checkstyle will be of great help to new developers.
> >
> > Best,
> > Weihua
> >
> >
> > On Wed, Apr 26, 2023 at 1:47 PM Jing Ge 
> > wrote:
> >
> > > This is a great idea, thanks for bringing this up. +1
> > >
> > > Also +1 for Junit4. If I am not mistaken, it could only be done after the
> > > Junit5 migration is done.
> > >
> > > @Chesnay thanks for the hint. Do we have any doc about it? If not, it
> > might
> > > deserve one. WDYT?
> > >
> > > Best regards,
> > > Jing
> > >
> > > On Wed, Apr 26, 2023 at 5:13 AM Lijie Wang 
> > > wrote:
> > >
> > > > Thanks for driving this. +1 for the proposal.
> > > >
> > > > Can we also prevent Junit4 usage in new code by this way?Because
> > > currently
> > > > we are aiming to migrate our codebase to JUnit 5.
> > > >
> > > > Best,
> > > > Lijie
> > > >
> > > > Piotr Nowojski  于2023年4月25日周二 23:02写道:
> > > >
> > > > > Ok, thanks for the clarification.
> > > > >
> > > > > Piotrek
> > > > >
> > > > > wt., 25 kwi 2023 o 16:38 Chesnay Schepler 
> > > > napisał(a):
> > > > >
> > > > > > The checkstyle rule would just ban certain imports.
> > > > > > We'd add exclusions for all existing usages as we did when
> > > introducing
> > > > > > other rules.
> > > > > > So far we usually disabled checkstyle rules for a specific files.
> > > > > >
> > > > > > On 25/04/2023 16:34, Piotr Nowojski wrote:
> > > > > > > +1 to the idea.
> > > > > > >
> > > > > > > How would this checkstyle rule work? Are you suggesting to start
> > > > with a
> > > > > > > number of exclusions? On what level will those exclusions be? Per
> > > > file?
> > > > > > Per
> > > > > > > line?
> > > > > > >
> > > > > > > Best,
> > > > > > > Piotrek
> > > > > > >
> > > > > > > wt., 25 kwi 2023 o 13:18 David Morávek 
> > > napisał(a):
> > > > > > >
> > > > > > >> Hi Everyone,
> > > > > > >>
> > > > > > >> A long time ago, the community decided not to use Mockito-based
> > > > tests
> > > > > > >> because those are hard to maintain. This is already baked in our
> > > > Code
> > > > > > Style
> > > > > > >> and Quality Guide [1].
> > > > > > >>
> > > > > > >> Because we still have Mockito imported into the code base, it's
> > > very
> > > > > > easy
> > > > > > >> for newcomers to unconsciously introduce new tests violating the
> > > > code
> > > > > > style
> > > > > > >> because they're unaware of the decision.
> > > > > > >>
> > > > > > >> I propose to prevent Mockito usage with a Checkstyle rule for a
> > > new
> > > > > > code,
> > > > > > >> which would eventually allow us to eliminate it. This could also
> > > > > prevent
> > > > > > >> some wasted work and unnecessary feedback cycles during reviews.
> > > > > > >>
> > > > > > >> WDYT?
> > > > > > >>
> > > > > > >> [1]
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> > https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#avoid-mockito---use-reusable-test-implementations
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> D.
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >


Re: [DISCUSS] FLINK-31873: Add setMaxParallelism to the DataStreamSink Class

2023-04-25 Thread Maximilian Michels
+1

On Tue, Apr 25, 2023 at 5:24 PM David Morávek  wrote:
>
> Hi Eric,
>
> this sounds reasonable, there are definitely cases where you need to limit
> sink parallelism for example not to overload the storage or limit the
> number of output files
>
> +1
>
> Best,
> D.
>
> On Sun, Apr 23, 2023 at 1:09 PM Weihua Hu  wrote:
>
> > Hi, Eric
> >
> > Thanks for bringing this discussion.
> > I think it's reasonable to add ''setMaxParallelism" for DataStreamSink.
> >
> > +1
> >
> > Best,
> > Weihua
> >
> >
> > On Sat, Apr 22, 2023 at 3:20 AM eric xiao  wrote:
> >
> > > Hi there devs,
> > >
> > > I would like to start a discussion thread for FLINK-31873[1].
> > >
> > > We are in the processing of enabling Flink reactive mode as the default
> > > scheduling mode. While reading configuration docs [2] (I believe it was
> > > also mentioned during one of the training sessions during Flink Forward
> > > 2022), one can/should replace all setParallelism calls with
> > > setMaxParallelism when migrating to reactive mode.
> > >
> > > This currently isn't possible on a sink in a Flink pipeline as we do not
> > > expose a setMaxParallelism on the DataStreamSink class [3]. The
> > underlying
> > > Transformation class does have both a setMaxParallelism and
> > setParallelism
> > > function defined [4], but only setParallelism is offered in the
> > > DataStreamSink class.
> > >
> > > I believe adding setMaxParallelism would be beneficial for not just flink
> > > reactive mode, both modes of running of a flink pipeline (non reactive
> > > mode, flink auto scaling).
> > >
> > > Best,
> > >
> > > Eric Xiao
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-31873
> > > [2]
> > >
> > >
> > https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/elastic_scaling/#configuration
> > > [3]
> > >
> > >
> > https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/datastream/DataStreamSink.java
> > > [4]
> > >
> > >
> > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/dag/Transformation.java#L248-L285
> > >
> >


[jira] [Created] (FLINK-31921) Create a mini cluster based metric collection test

2023-04-24 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-31921:
--

 Summary: Create a mini cluster based metric collection test
 Key: FLINK-31921
 URL: https://issues.apache.org/jira/browse/FLINK-31921
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels


We would benefit from an e2e test for metric collection which verifies 
assumptions we have about Flink metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31920) Flaky tests

2023-04-24 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-31920:
--

 Summary: Flaky tests 
 Key: FLINK-31920
 URL: https://issues.apache.org/jira/browse/FLINK-31920
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Maximilian Michels


[ERROR] Errors: 
[ERROR]   FlinkOperatorTest.testConfigurationPassedToJOSDK:63 » NullPointer
[ERROR]   FlinkOperatorTest.testLeaderElectionConfig:108 » NullPointer
[ERROR]   HealthProbeTest.testHealthProbeEndpoint:64 » NullPointer
[INFO] 
[ERROR] Tests run: 323, Failures: 0, Errors: 3, Skipped: 0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release flink-connector-kafka 3.0.0 for Flink 1.17, release candidate #2

2023-04-21 Thread Maximilian Michels
Thanks for driving the release Gordon!

-Max

On Fri, Apr 21, 2023 at 12:33 AM Tzu-Li (Gordon) Tai
 wrote:
>
> We have unanimously approved this release.
>
> There are 6 approving votes, 3 of which are binding:
>
> * Alexander Sorokoumov
> * Martijn Visser (binding)
> * Tzu-Li (Gordon) Tai (binding)
> * Danny Cranmer (binding)
> * Ahmed Hamdy
> * Mason Chen
>
> Thanks so much everyone for testing and voting! I will now finalize the
> release.
>
> Thanks,
> Gordon
>
> On Thu, Apr 20, 2023 at 3:04 PM Mason Chen  wrote:
>
> > +1 (non-binding)
> >
> > * Verified hashes and signatures
> > * Verified no binaries
> > * Verified LICENSE and NOTICE files, pointing to 2023 as well
> > * Verified poms point to 3.0.0
> > * Reviewed web PR
> > * Built from source
> > * Verified git tag
> >
> > Best,
> > Mason
> >
> > On Thu, Apr 20, 2023 at 10:04 AM Ahmed Hamdy  wrote:
> >
> > > +1 (non-binding)
> > >
> > > - Release notes look good.
> > > - verified signatures and checksums are correct.
> > > - Verified no binaries in source archive.
> > > - Built from source
> > > - Approved Web PR (no comments if we are supporting 1.17+)
> > > Best Regards
> > > Ahmed
> > >
> > > On Thu, 20 Apr 2023 at 17:08, Danny Cranmer 
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > - +1 on skipping 1.16
> > > > - Release notes look ok
> > > > - Verified signature/hashes of source archive
> > > > - Verified there are no binaries in the source archive
> > > > - Built from source
> > > > - Contents of Maven repo look good
> > > > - Verified NOTICE files
> > > > - Tag exists in Github
> > > > - Reviewed web PR (looks good apart from the open comment from Martijn)
> > > >
> > > >
> > > > On Tue, Apr 18, 2023 at 6:38 PM Tzu-Li (Gordon) Tai <
> > tzuli...@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > - Checked hashes and signatures
> > > > > - Built from source mvn clean install -Pcheck-convergence
> > > > > -Dflink.version=1.17.0
> > > > > - Eyeballed NOTICE license files
> > > > > - Started a Flink 1.17.0 cluster + Kafka 3.2.3 cluster, submitted a
> > SQL
> > > > > statement using the Kafka connector under exactly-once mode.
> > > > Checkpointing
> > > > > and restoring works, with or without throughput on the Kafka topic.
> > > > >
> > > > > Thanks,
> > > > > Gordon
> > > > >
> > > > > On Fri, Apr 14, 2023 at 2:13 AM Martijn Visser <
> > > martijnvis...@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > - Validated hashes
> > > > > > - Verified signature
> > > > > > - Verified that no binaries exist in the source archive
> > > > > > - Build the source with Maven via mvn clean install
> > > -Pcheck-convergence
> > > > > > -Dflink.version=1.17.0
> > > > > > - Verified licenses
> > > > > > - Verified web PR
> > > > > > - Started a cluster and the Flink SQL client, successfully read and
> > > > wrote
> > > > > > with the Kafka connector to Confluent Cloud with AVRO and Schema
> > > > Registry
> > > > > > enabled
> > > > > >
> > > > > > On Fri, Apr 14, 2023 at 12:24 AM Alexander Sorokoumov
> > > > > >  wrote:
> > > > > >
> > > > > > > +1 (nb).
> > > > > > >
> > > > > > > Checked:
> > > > > > >
> > > > > > >- checksums are correct
> > > > > > >- source code builds (JDK 8+11)
> > > > > > >- release notes are correct
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Alex
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Apr 12, 2023 at 5:07 PM Tzu-Li (Gordon) Tai <
> > > > > tzuli...@apache.org
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > A few important remarks about this release candidate:
> > > > > > > >
> > > > > > > > - As mentioned in the previous voting thread of RC1 [1], we've
> > > > > decided
> > > > > > to
> > > > > > > > skip releasing a version of the externalized Flink Kafka
> > > Connector
> > > > > > > matching
> > > > > > > > with Flink 1.16.x since the original vote thread stalled, and
> > > > > meanwhile
> > > > > > > > we've already completed externalizing all Kafka connector code
> > as
> > > > of
> > > > > > > Flink
> > > > > > > > 1.17.0.
> > > > > > > >
> > > > > > > > - As such, this RC is basically identical to the Kafka
> > connector
> > > > code
> > > > > > > > bundled with the Flink 1.17.0 release, PLUS a few critical
> > fixes
> > > > for
> > > > > > > > exactly-once violations, namely FLINK-31305, FLINK-31363, and
> > > > > > FLINK-31620
> > > > > > > > (please see release notes [2]).
> > > > > > > >
> > > > > > > > - As part of preparing this RC, I've also deleted the original
> > > v3.0
> > > > > > > branch
> > > > > > > > and re-named the v4.0 branch to replace it instead.
> > Effectively,
> > > > this
> > > > > > > > resets the versioning numbers for the externalized Flink Kafka
> > > > > > Connector
> > > > > > > > code repository, so that this first release of the repo starts
> > > from
> > > > > > > v3.0.0.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Gordon
> > > > > > > 

[jira] [Created] (FLINK-31867) Enforce a minimum number of observations within a metric window

2023-04-20 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-31867:
--

 Summary: Enforce a minimum number of observations within a metric 
window
 Key: FLINK-31867
 URL: https://issues.apache.org/jira/browse/FLINK-31867
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels


The metric window is currently only time-based. We should make sure we see a 
minimum number of observations to ensure we don't decide based on too few 
observations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31866) Autoscaler metric trimming reduces the numbet of metric observations

2023-04-20 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-31866:
--

 Summary: Autoscaler metric trimming reduces the numbet of metric 
observations
 Key: FLINK-31866
 URL: https://issues.apache.org/jira/browse/FLINK-31866
 Project: Flink
  Issue Type: Bug
Reporter: Maximilian Michels
Assignee: Maximilian Michels


The autoscaler uses a ConfigMap to store past metric observations which is used 
to re-initialize the autoscaler state in case of failures or upgrades.

Whenever trimming of the ConfigMap occurs, we need to make sure we also update 
the timestamp for the start of the metric collection, so any removed 
observations can be compensated with by collecting new ones. If we don't do 
this, the metric window will effectively shrink due to removing observations.

This can lead to triggering scaling decisions when the operator gets redeployed 
due to the removed items.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31684) Autoscaler metrics are only visible after metric window is full

2023-03-31 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-31684:
--

 Summary: Autoscaler metrics are only visible after metric window 
is full
 Key: FLINK-31684
 URL: https://issues.apache.org/jira/browse/FLINK-31684
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels


The metrics get reported only after the metric window is full. This is not 
helpful for observability after rescaling. We need to make sure that metrics 
are reported even when the metric window is not yet full.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31502) Limit the number of concurrent scale operations to reduce cluster churn

2023-03-17 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-31502:
--

 Summary: Limit the number of concurrent scale operations to reduce 
cluster churn
 Key: FLINK-31502
 URL: https://issues.apache.org/jira/browse/FLINK-31502
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.5.0


Until we move to using the upcoming Rescale API which recycles pods, we need to 
be mindful with how many deployments we scale at the same time because each of 
them is going to give up all its pods and require the new number of required 
pods. 

This can cause churn in the cluster and temporary lead to "unallocatable" pods 
which triggers the k8s cluster autoscaler to add more cluster nodes. That is 
often not desirable because the actual required resources after the scaling 
have been settled, are lower.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [Vote] FLIP-298: Unifying the Implementation of SlotManager

2023-03-13 Thread Maximilian Michels
+1 (binding)

On Mon, Mar 13, 2023 at 11:33 AM Etienne Chauchot  wrote:
>
> +1 (not binding)
>
> Etienne
>
> Le 11/03/2023 à 07:37, Yangze Guo a écrit :
> > +1 (binding)
> >
> > Zhanghao Chen  于 2023年3月10日周五 下午5:07写道:
> >
> >> Thanks Weihua. +1 (non-binding)
> >>
> >> Best,
> >> Zhanghao Chen
> >> 
> >> From: Weihua Hu 
> >> Sent: Thursday, March 9, 2023 13:27
> >> To: dev 
> >> Subject: [Vote] FLIP-298: Unifying the Implementation of SlotManager
> >>
> >> Hi Everyone,
> >>
> >> I would like to start the vote on FLIP-298: Unifying the Implementation
> >> of SlotManager [1]. The FLIP was discussed in this thread [2].
> >>
> >> This FLIP aims to unify the implementation of SlotManager in
> >> order to reduce maintenance costs.
> >>
> >> The vote will last for at least 72 hours (03/14, 15:00 UTC+8)
> >> unless there is an objection or insufficient votes. Thank you all.
> >>
> >> [1]
> >>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> >> [2]https://lists.apache.org/thread/ocssfxglpc8z7cto3k8p44mrjxwr67r9
> >>
> >> Best,
> >> Weihua
> >>


Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

2023-03-10 Thread Maximilian Michels
+1 on the proposal.

I'm wondering about the scale down behavior of the unified
implementation. Does the new unified implementation prioritize
releasing entire task managers in favor of evenly spreading out task
managers? Consider a scale down from parallelism 15 to parallelism 10
where each task manager has 5 slots. We could either spread out 10
slots among the 3 task managers, or fit the 10 slots in 2 task
managers and surrender the task manager (at least on k8s). From a cost
saving perspective, the latter would be preferable.

-Max

On Thu, Mar 9, 2023 at 4:17 PM Matthias Pohl
 wrote:
>
> Thanks for your clarification. I have nothing else to add to the
> discussion. +1 from my side to proceed
>
> On Wed, Mar 8, 2023 at 4:16 AM Weihua Hu  wrote:
>
> > Thanks Yangze for your attention, this would be a great help.
> >
> > And thanks Matthias too.
> >
> > FLIP-156 [1] mentions some incompatibility between fine-grained resource
> > > management and reactive mode. I assume that this is independent of the
> > > SlotManager and replacing the DSM with the FGSM wouldn't affect reactive
> > > mode?
> >
> > Yes. This incompatibility is independent of SlotManager. That means the
> > AdpativeScheduler will always ignore the resource requirement set by
> > slotSharingGroup and declare Unknown ResourceProfile to SlotManager.
> > So, using FGSM as default will not affect reactive mode.
> >
> > About the heterogeneous TaskManager: This is a feature that's also not
> > > supported in the DSM right now, is it? We should state that fact in the
> > > FLIP if we mentioned that we don't want to implement it for the FSGM.
> >
> > Yes, both DSM and FGSM do not support request heterogeneous
> > TaskManager right now. Heterogeneous will make the resource allocation
> > logic more complicated, such as the resource deadlock if request A
> > allocated
> > the bigger slot B and then request B could not allocate the small slot A.
> > We
> > need to think more before starting to support the heterogeneous task
> > manager.
> > So, we don't want to implement heterogeneity in this FLIP.
> >
> > Best,
> > Weihua
> >
> >
> > On Wed, Mar 8, 2023 at 12:44 AM Matthias Pohl
> >  wrote:
> >
> > > Thanks for updating the FLIP and adding more context to it. Additionally,
> > > thanks to Xintong and Yangze for offering your expertise here as
> > > contributors to the initial FineGrainedSlotManager implementation.
> > >
> > > The remark on cutting out functionality was only based on some
> > superficial
> > > initial code reading. I cannot come up with a better code structure
> > myself.
> > > Therefore, I'm fine with not refactoring the code as part of this FLIP.
> > >
> > > The strategies that were proposed around making sure that the refactoring
> > > is properly backed by tests sound reasonable. My initial concern was
> > based
> > > on the fact that we might have unit test scenarios for the DSM that are
> > not
> > > covered in the unit tests of the FSGM. In that case, swapping the DSM
> > with
> > > the FSGM might not be good enough. Going over the GSM tests to make sure
> > > that we're not accidentally deleting test scenarios sounds good to me.
> > > Thanks, Weihua.
> > >
> > > FLIP-156 [1] mentions some incompatibility between fine-grained resource
> > > management and reactive mode. I assume that this is independent of the
> > > SlotManager and replacing the DSM with the FGSM wouldn't affect reactive
> > > mode?
> > >
> > > About the heterogeneous TaskManager: This is a feature that's also not
> > > supported in the DSM right now, is it? We should state that fact in the
> > > FLIP if we mentioned that we don't want to implement it for the FSGM.
> > >
> > > Best,
> > > Matthias
> > >
> > > [1]
> > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements
> > >
> > > On Tue, Mar 7, 2023 at 8:58 AM Yangze Guo  wrote:
> > >
> > > > Hi Weihua,
> > > >
> > > > Thanks for driving this. As Xintong mentioned, this was a technical
> > > > debt from FLIP-56.
> > > >
> > > > The latest version of FLIP sounds good, +1 from my side. As a
> > > > contributor to this component, I'm willing to assist with the review
> > > > process. Feel free to reach me if you need help.
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > > On Tue, Mar 7, 2023 at 1:47 PM Weihua Hu 
> > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > @David @Matthias
> > > > > There are a few days after hearing your thoughts. I would like to
> > know
> > > if
> > > > > there are any other concerns about this FLIP.
> > > > >
> > > > >
> > > > > Best,
> > > > > Weihua
> > > > >
> > > > >
> > > > > On Mon, Mar 6, 2023 at 7:53 PM Weihua Hu 
> > > wrote:
> > > > >
> > > > > >
> > > > > > Thanks Shammon,
> > > > > >
> > > > > > I've updated FLIP to add this redundant Task Manager limitation.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Weihua
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 6, 2023 at 5:07 PM 

[jira] [Created] (FLINK-31400) Add autoscaler integration for Iceberg source

2023-03-10 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-31400:
--

 Summary: Add autoscaler integration for Iceberg source
 Key: FLINK-31400
 URL: https://issues.apache.org/jira/browse/FLINK-31400
 Project: Flink
  Issue Type: New Feature
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
 Fix For: kubernetes-operator-1.5.0


A very critical part in the scaling algorithm is setting the source processing 
correctly such that the Flink pipeline can keep up with the ingestion rate. The 
autoscaler does that by looking at the {{pendingRecords}} Flink source metric. 
Even if that metric is not available, the source can still be sized according 
to the busyTimeMsPerSecond metric, but there will be no backlog information 
available. For Kafka, the autoscaler also determines the number of partitions 
to avoid scaling higher than the maximum number of partitions.

In order to support a wider range of use cases, we should investigate an 
integration with the Iceberg source. As far as I know, it does not expose the 
pedingRecords metric, nor does the autoscaler know about other constraints, 
e.g. the maximum number of open files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31345) Trim autoscaler configMap to not exceed 1mb size limit

2023-03-06 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-31345:
--

 Summary: Trim autoscaler configMap to not exceed 1mb size limit
 Key: FLINK-31345
 URL: https://issues.apache.org/jira/browse/FLINK-31345
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Kubernetes Operator
Affects Versions: kubernetes-operator-1.4.0
Reporter: Maximilian Michels
 Fix For: kubernetes-operator-1.5.0


When the {{autoscaler-}} ConfigMap which is used to persist 
scaling decisions and metrics becomes too large, the following error is thrown 
consistently:

{noformat}
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [replace]  
for kind: [ConfigMap]  with name: [deployment]  in namespace: [namespace]  
failed.
    at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:159)
    at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:169)
    at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:172)
    at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:113)
    at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:41)
    at 
io.fabric8.kubernetes.client.extension.ResourceAdapter.replace(ResourceAdapter.java:252)
    at 
org.apache.flink.kubernetes.operator.autoscaler.AutoScalerInfo.replaceInKubernetes(AutoScalerInfo.java:167)
    at 
org.apache.flink.kubernetes.operator.autoscaler.JobAutoScalerImpl.scale(JobAutoScalerImpl.java:113)
    at 
org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:178)
    at 
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:130)
    at 
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:56)
    at 
io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:145)
    at 
io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:103)
    at 
org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80)
    at 
io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:102)
    at 
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:139)
    at 
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:119)
    at 
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:89)
    at 
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:62)
    at 
io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: stream was reset: NO_ERROR
    at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:514)
    at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:551)
    at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleUpdate(OperationSupport.java:347)
    at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleUpdate(BaseOperation.java:680)
    at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:167)
    ... 21 more
Caused by: okhttp3.internal.http2.StreamResetException: stream was reset: 
NO_ERROR
    at 
okhttp3.internal.http2.Http2Stream.checkOutNotClosed$okhttp(Http2Stream.kt:646)
    at 
okhttp3.internal.http2.Http2Stream$FramingSink.emitFrame(Http2Stream.kt:557)
    at okhttp3.internal.http2.Http2Stream$FramingSink.write(Http2Stream.kt:532)
    at okio.ForwardingSink.write(ForwardingSink.kt:29)
    at 
okhttp3.internal.connection.Exchange$RequestBodySink.write(Exchange.kt:218)
    at okio.RealBufferedSink.emitCompleteSegments(RealBufferedSink.kt:255)
    at okio.RealBufferedSink.write(RealBufferedSink.kt:185)
    at okhttp3.RequestBody$Companion$toRequestBody$2.writeTo(RequestBody.kt:152)
    at 
okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.kt:59)
    at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at 
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:34)
    at 
okhttp3

[jira] [Created] (FLINK-31299) PendingRecords metric might not be available

2023-03-02 Thread Maximilian Michels (Jira)
Maximilian Michels created FLINK-31299:
--

 Summary: PendingRecords metric might not be available
 Key: FLINK-31299
 URL: https://issues.apache.org/jira/browse/FLINK-31299
 Project: Flink
  Issue Type: Bug
  Components: Autoscaler, Kubernetes Operator
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: kubernetes-operator-1.5.0


The Kafka pendingRecords metric is only initialized on receiving the first 
record. For empty topics or checkpointed topics without any incoming data, the 
metric won't appear.

We need to handle this case in the autoscaler and allow downscaling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

2023-02-28 Thread Maximilian Michels
I agree that it is useful to have a configurable lower bound. Thanks
for looking into it as part of a follow up!

No objections from my side to move forward with the vote.

-Max

On Tue, Feb 28, 2023 at 1:36 PM David Morávek  wrote:
>
> > I suppose we could further remove the min because it would always be
> safer to scale down if resources are not available than not to run at
> all [1].
>
> Apart from what @Roman has already mentioned, there are still cases where
> we're certain that there is no point in running the jobs with resources
> lower than X; e.g., because the state is too large to be processed with
> parallelism of 1; this allows you not to waste resources if you're certain
> that the job would go into the restart loop / won't be able to checkpoint
>
> I believe that for most use cases, simply keeping the lower bound at 1 will
> be sufficient.
>
> > I saw that the minimum bound is currently not used in the code you posted
> above [2]. Is that still planned?
>
> Yes. We already allow setting the lower bound via API, but it's not
> considered by the scheduler. I'll address this limitation in a separate
> issue.
>
> > Note that originally we had assumed min == max but I think that would be
> a less safe scaling approach because we would get stuck waiting for
> resources when they are not available, e.g. k8s resource limits reached.
>
> 100% agreed; The above-mentioned knobs should allow you to balance the
> trade-off.
>
>
> Does that make sense?
>
> Best,
> D.
>
>
>
> On Tue, Feb 28, 2023 at 1:14 PM Roman Khachatryan  wrote:
>
> > Hi,
> >
> > Thanks for the update, I think distinguishing the rescaling behaviour and
> > the desired parallelism declaration is important.
> >
> > Having the ability to specify min parallelism might be useful in
> > environments with multiple jobs: Scheduler will then have an option to stop
> > the less suitable job.
> > In other setups, where the job should not be stopped at all, the user can
> > always set it to 0.
> >
> > Regards,
> > Roman
> >
> >
> > On Tue, Feb 28, 2023 at 12:58 PM Maximilian Michels 
> > wrote:
> >
> >> Hi David,
> >>
> >> Thanks for the update! We consider using the new declarative resource
> >> API for autoscaling. Currently, we treat a scaling decision as a new
> >> deployment which means surrendering all resources to Kubernetes and
> >> subsequently reallocating them for the rescaled deployment. The
> >> declarative resource management API is a great step forward because it
> >> allows us to do faster and safer rescaling. Faster, because we can
> >> continue to run while resources are pre-allocated which minimizes
> >> downtime. Safer, because we can't get stuck when the desired resources
> >> are not available.
> >>
> >> An example with two vertices and their respective parallelisms:
> >>   v1: 50
> >>   v2: 10
> >> Let's assume slot sharing is disabled, so we need 60 task slots to run
> >> the vertices.
> >>
> >> If the autoscaler was to decide to scale up v1 and v2, it could do so
> >> in a safe way by using min/max configuration:
> >>   v1: [min: 50, max: 70]
> >>   v2: [min: 10, max: 20]
> >> This would then need 90 task slots to run at max capacity.
> >>
> >> I suppose we could further remove the min because it would always be
> >> safer to scale down if resources are not available than to not run at
> >> all [1]. In fact, I saw that the minimum bound is currently not used
> >> in the code you posted above [2]. Is that still planned?
> >>
> >> -Max
> >>
> >> PS: Note that originally we had assumed min == max but I think that
> >> would be a less safe scaling approach because we would get stuck
> >> waiting for resources when they are not available, e.g. k8s resource
> >> limits reached.
> >>
> >> [1] However, there might be costs involved with executing the
> >> rescaling, e.g. for using external storage like s3, especially without
> >> local recovery.
> >> [2]
> >> https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9
> >>
> >> On Tue, Feb 28, 2023 at 9:33 AM David Morávek  wrote:
> >> >
> >> > Hi Everyone,
> >> >
> >> > We had some more talks about the pre-allocation of resources with @Max,
> >> and
> >> > here is the final state that we've converged to for now:
> >> >
> &g

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

2023-02-28 Thread Maximilian Michels
gt;
> > Good catch! I'm fixing it now, thanks!
> >
> > [1]
> > https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9#diff-a4b690fb2c4975d25b05eb4161617af0d704a85ff7b1cad19d3c817c12f1e29cR1151
> >
> > Best,
> > D.
> >
> > On Tue, Feb 21, 2023 at 12:24 AM John Roesler  wrote:
> >
> >> Thanks for the FLIP, David!
> >>
> >> I just had one small question. IIUC, the REST API PUT request will go
> >> through the new DispatcherGateway method to be handled. Then, after
> >> validation, the dispatcher would call the new JobMasterGateway method to
> >> actually update the job.
> >>
> >> Which component will write the updated JobGraph? I just wanted to make
> >> sure it’s the JobMaster because it it were the dispatcher, there could be a
> >> race condition with the async JobMaster method.
> >>
> >> Thanks!
> >> -John
> >>
> >> On Mon, Feb 20, 2023, at 07:34, Matthias Pohl wrote:
> >> > Thanks for your clarifications, David. I don't have any additional major
> >> > points to add. One thing about the FLIP: The RPC layer API for updating
> >> the
> >> > JRR returns a future with a JRR? I don't see value in returning a JRR
> >> here
> >> > since it's an idempotent operation? Wouldn't it be enough to return
> >> > CompletableFuture here? Or am I missing something?
> >> >
> >> > Matthias
> >> >
> >> > On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels 
> >> wrote:
> >> >
> >> >> Thanks David! If we could get the pre-allocation working as part of
> >> >> the FLIP, that would be great.
> >> >>
> >> >> Concerning the downscale case, I agree this is a special case for the
> >> >> (single-job) application mode where we could re-allocate slots in a
> >> >> way that could leave entire task managers unoccupied which we would
> >> >> then be able to release. The goal essentially is to reduce slot
> >> >> fragmentation on scale down by packing the slots efficiently. The
> >> >> easiest way to add this optimization when running in application mode
> >> >> would be to drop as many task managers during the restart such that
> >> >> NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look into
> >> >> this independently of the FLIP.
> >> >>
> >> >> Feel free to start the vote.
> >> >>
> >> >> -Max
> >> >>
> >> >> On Mon, Feb 20, 2023 at 9:10 AM David Morávek  wrote:
> >> >> >
> >> >> > Hi everyone,
> >> >> >
> >> >> > Thanks for the feedback! I've updated the FLIP to use idempotent PUT
> >> API
> >> >> instead of PATCH and to properly handle lower bound settings, to
> >> support
> >> >> the "pre-allocation" of the resources.
> >> >> >
> >> >> > @Max
> >> >> >
> >> >> > > How hard would it be to address this issue in the FLIP?
> >> >> >
> >> >> > I've included this in the FLIP. It might not be too hard to implement
> >> >> this in the end.
> >> >> >
> >> >> > > B) drop as many superfluous task managers as needed
> >> >> >
> >> >> > I've intentionally left this part out for now because this ultimately
> >> >> needs to be the responsibility of the Resource Manager. After all, in
> >> the
> >> >> Session Cluster scenario, the Scheduler doesn't have the bigger
> >> picture of
> >> >> other tasks of other jobs running on those TMs. This will most likely
> >> be a
> >> >> topic for another FLIP.
> >> >> >
> >> >> > WDYT? If there are no other questions or concerns, I'd like to start
> >> the
> >> >> vote on Wednesday.
> >> >> >
> >> >> > Best,
> >> >> > D.
> >> >> >
> >> >> > On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels 
> >> >> wrote:
> >> >> >>
> >> >> >> I missed that the FLIP states:
> >> >> >>
> >> >> >> > Currently, even though we’d expose the lower bound for clarity and
> >> >>

  1   2   3   4   5   6   7   8   9   >