Hi, Matthias, I was wondering if there are any timeout or heartbeat configurations for KubeHA available.
Thanks. On Mon, 5 Feb 2024 at 8:58 PM, Matthias Pohl <matthias.p...@aiven.io.invalid> wrote: > That's stated in the Jira issue. I didn't have the time to investigate it > further. > > On Mon, Feb 5, 2024 at 1:55 PM Lavkesh Lahngir <lavk...@linux.com> wrote: > > > Hi Matthias, > > Thanks for the suggestion. Do we know which part of code caused this > issue > > and how it was fixed? > > > > Thanks! > > > > On Mon, 5 Feb 2024 at 18:06, Matthias Pohl <matthias.p...@aiven.io > > .invalid> > > wrote: > > > > > Hi Lavkesh, > > > FLINK-33998 [1] sounds quite similar to what you describe. > > > > > > The solution was to upgrade to Flink version 1.14.6. I didn't have the > > > capacity to look into the details considering that the mentioned Flink > > > version 1.14 is not officially supported by the community anymore and a > > fix > > > seems to have been provided with a newer version. > > > > > > Matthias > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-33998 > > > > > > On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir <lavk...@linux.com> > > wrote: > > > > > > > Hii, Few more details: > > > > We are running GKE version 1.27.7-gke.1121002. > > > > and using flink version 1.14.3. > > > > > > > > Thanks! > > > > > > > > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir <lavk...@linux.com> > > wrote: > > > > > > > > > Hii All, > > > > > > > > > > We run a Flink operator on GKE, deploying one Flink job per job > > > manager. > > > > > We utilize > > > > > > > > > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > > > > > for high availability. The JobManager employs config maps for > > > > checkpointing > > > > > and leader election. If, at any point, the Kube API server returns > an > > > > error > > > > > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic, > > > > > happening every 1-2 days for some jobs among the 400 running in the > > > same > > > > > cluster, each with its JobManager pod. > > > > > > > > > > What might be causing these errors from the Kube? One possibility > is > > > that > > > > > when the JM writes the config map and attempts to retrieve it > > > immediately > > > > > after, it could result in a 404 error. > > > > > Are there any configurations to increase heartbeat or timeouts that > > > might > > > > > be causing temporary disconnections from the Kube API server? > > > > > > > > > > Thank you! > > > > > > > > > > > > > > >