Hi Krzysztof, TM HA is taken charge by the Flink cluster itself is beyond K8s operator's responsibility. Flink will try to recover a failed Task as long as the restart limit is not reached otherwise the job will transition into terminal FAILED status. You may check the job restart strategy [1] for more details.
[1] https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/state/task_failure_recovery/#restart-strategies Best, Zhanghao Chen ________________________________ 发件人: Krzysztof Chmielewski <krzysiek.chmielew...@gmail.com> 发送时间: 2023年9月17日 7:58 收件人: user <user@flink.apache.org> 主题: HA in k8s operator Hi community, I would like to test flink k8s operator's HA capabilities for TM and JM failover. The simple test I did for TM failover was as follows: - run Flink session cluster in native mode - submit FlinkSessionJob resource with SAVEPOINT upgreade mode. - kill task manager pod It turns out that after I killed the TM, k8s operator does not create a new TM that would replace the killed one. The job was canceled and landed in Job Status -> Failed. I had an impression that for TM HA no extra configuration is needed. I have found [1] and [2]. But I'm not sure if this is for JM failvoer only or both, TM and JM. Also it is not clear for me if when using flink k8s operat do I still need to configure [1]? [1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#kubernetes-ha-services [2] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#leader-election-and-high-availability Regards, Krzysztof Chmielewski