Re: JM cannot recover with Kubernetes HA

2021-05-31 Thread Yang Wang
When your APIServer or ETCD of your K8s cluster is working in heavy load, then the fabric8 kubernetes client might get a timeout when watching/renewing/getting the ConfigMap. I think you could increase the read/connect timeout(default is 10s) of http client and have a try. env.java.opts:

Re: JM cannot recover with Kubernetes HA

2021-05-28 Thread Matthias Pohl
Hi Enrique, thanks for reaching out to the community. I'm not 100% sure what problem you're facing. The log messages you're sharing could mean that the Flink cluster still behaves as normal having some outages and the HA functionality kicking in. The behavior you're seeing with leaders for the

Re: JM cannot recover with Kubernetes HA

2021-05-27 Thread Enrique
To add to my post, instead of using POD IP for the `jobmanager.rpc.address` configuration we start each JM pod with the Fully Qualified Name `--host ..ns.svc:8081` and this address gets persisted to the ConfigMaps. In some scenarios, the leader address in the ConfigMaps might differ. For

JM cannot recover with Kubernetes HA

2021-05-27 Thread Enrique
Hi All, Flink 1.13.0 I have a Session cluster deployed with StatefulSet + PVs with HA configured within a Kubernetes cluster. I have submitted jobs to it, and it all works fine. Most of my jobs are long-running, typically consuming data from Kafka. I have noticed that after some time all my