Re: Flink Kubernetes HA

2024-01-15 Thread Yang Wang
The fabric8 K8s client is using PATCH to replace get-and-update in v6.6.2. That's why you also need to give PATCH permission for the K8s service account. This would help to decrease the pressure of K8s APIServer. You could find more information here[1]. [1].

Re: Flink Kubernetes HA

2023-12-06 Thread Zhanghao Chen
r election is performed with a unified config map for doing that. Best, Zhanghao Chen From: Ethan T Yang Sent: Wednesday, December 6, 2023 5:40 To: user@flink.apache.org Subject: Flink Kubernetes HA Hi Flink users, After upgrading Flink ( from 1.13.1 -> 1.18.

Re: Flink Kubernetes HA

2023-12-06 Thread Ethan T Yang
Never mind. The issue was fix due to the service account permission missing “patch” verb. Which lead to RPC service not started. > On Dec 5, 2023, at 1:40 PM, Ethan T Yang wrote: > > Hi Flink users, > After upgrading Flink ( from 1.13.1 -> 1.18.0), I noticed the an issue when > HA is

Flink Kubernetes HA

2023-12-05 Thread Ethan T Yang
Hi Flink users, After upgrading Flink ( from 1.13.1 -> 1.18.0), I noticed the an issue when HA is enabled.( see exception below). I am using k8s deployment and I clean the previous configmaps, like leader files etc. I know the pekko is a recently thing. Can someone share doc on how to use or

Re: Bulk Scheduler timeout when creating several jobs in flink kubernetes HA deployment

2021-08-26 Thread Gil De Grove
Hello Matthias, I'll extract the logs from the cluster au update that here. For the tm's, i'll try to find relevant logs, we had many of them deployed at that time. And all of the logs may not be that interesting to upload. Regards, Gil On Thu, Aug 26, 2021, 12:31 Matthias Pohl wrote: > Hi

Re: Bulk Scheduler timeout when creating several jobs in flink kubernetes HA deployment

2021-08-26 Thread Matthias Pohl
Hi Gil, could you provide the complete logs (TaskManager & JobManager) for us to investigate it? The error itself and the behavior you're describing sounds like expected behavior if there are not enough slots available for all the submitted jobs to be handled in time. Have you tried increasing the

Bulk Scheduler timeout when creating several jobs in flink kubernetes HA deployment

2021-08-25 Thread Gil De Grove
Hello, We are struggling a bit with an error in our kubernetes deployment. The deployment is composed of 2 flink job managers and 58 task managers. When deploying the jobs everything is going fine at first, but after the deployment of several jobs (mix of batch and streaming job using the SQL

Re: Flink Kubernetes HA

2021-06-23 Thread Yang Wang
>From the implementation of DefaultCompletedCheckpointStore, Flink will only retain the configured amount of checkpoints. Maybe you could also check the content of jobmanager-leader ConfigMap. It should have the same number of pointers for the completedCheckpoint. Best, Yang Ivan Yang

Re: Flink Kubernetes HA

2021-06-23 Thread Ivan Yang
Thanks for the reply. Yes, We are seeing all the completedCheckpoint and they keep growing. We will revisit our k8s set up, configmap etc > On Jun 23, 2021, at 2:09 AM, Yang Wang wrote: > > Hi Ivan, > > For completedCheckpoint files will keep growing, do you mean too many > files

Flink Kubernetes HA

2021-06-22 Thread Ivan Yang
Hi Dear Flink users, We recently implemented enabled the zookeeper less HA in our kubernetes Flink deployment. The set up has high-availability.storageDir: s3://some-bucket/recovery Since we have a retention policy on the s3 bucket, relatively short 7 days. So the HA will fail if the