The thing is that when I've deployed an application cluster like in example [1] without any extra configuration and then I killed the TM, submitted job was moved to "RESTARTING state and then new TM was created after which job was running again. This is a different behavior that i see when I'm running session cluster [2].
How I can enable TM HA for session cluster? [1] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/overview/#application-deployments [2] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/overview/#session-cluster-deployments niedz., 17 wrz 2023 o 10:32 Krzysztof Chmielewski < krzysiek.chmielew...@gmail.com> napisał(a): > Thank you, > so in other words to have TM HA on k8s I have to configure [1] correct? > > [1] > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#kubernetes-ha-services > > niedz., 17 wrz 2023 o 07:27 Chen Zhanghao <zhanghao.c...@outlook.com> > napisał(a): > >> Hi Krzysztof, >> >> TM HA is taken charge by the Flink cluster itself is beyond K8s >> operator's responsibility. Flink will try to recover a failed Task as long >> as the restart limit is not reached otherwise the job will transition into >> terminal FAILED status. You may check the job restart strategy [1] for more >> details. >> >> [1] >> https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/state/task_failure_recovery/#restart-strategies >> >> Best, >> Zhanghao Chen >> ------------------------------ >> *发件人:* Krzysztof Chmielewski <krzysiek.chmielew...@gmail.com> >> *发送时间:* 2023年9月17日 7:58 >> *收件人:* user <user@flink.apache.org> >> *主题:* HA in k8s operator >> >> Hi community, >> I would like to test flink k8s operator's HA capabilities for TM and JM >> failover. >> >> The simple test I did for TM failover was as follows: >> - run Flink session cluster in native mode >> - submit FlinkSessionJob resource with SAVEPOINT upgreade mode. >> - kill task manager pod >> >> It turns out that after I killed the TM, k8s operator does not create a >> new TM that would replace the killed one. The job was canceled and landed >> in Job Status -> Failed. >> >> I had an impression that for TM HA no extra configuration is needed. >> I have found [1] and [2]. But I'm not sure if this is for JM failvoer >> only or both, TM and JM. Also it is not clear for me if when using flink >> k8s operat do I still need to configure [1]? >> >> [1] >> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#kubernetes-ha-services >> [2] >> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#leader-election-and-high-availability >> >> Regards, >> Krzysztof Chmielewski >> >> >>