Hii, Few more details:
We are running GKE version 1.27.7-gke.1121002.
and using flink version 1.14.3.

Thanks!

On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir <lavk...@linux.com> wrote:

> Hii All,
>
> We run a Flink operator on GKE, deploying one Flink job per job manager.
> We utilize
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> for high availability. The JobManager employs config maps for checkpointing
> and leader election. If, at any point, the Kube API server returns an error
> (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic,
> happening every 1-2 days for some jobs among the 400 running in the same
> cluster, each with its JobManager pod.
>
> What might be causing these errors from the Kube? One possibility is that
> when the JM writes the config map and attempts to retrieve it immediately
> after, it could result in a 404 error.
> Are there any configurations to increase heartbeat or timeouts that might
> be causing temporary disconnections from the Kube API server?
>
> Thank you!
>

Reply via email to