Hi Krzysztof,

TM HA is taken charge by the Flink cluster itself is beyond K8s operator's 
responsibility. Flink will try to recover a failed Task as long as the restart 
limit is not reached otherwise the job will transition into terminal FAILED 
status. You may check the job restart strategy [1] for more details.

[1] 
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/state/task_failure_recovery/#restart-strategies

Best,
Zhanghao Chen
________________________________
发件人: Krzysztof Chmielewski <krzysiek.chmielew...@gmail.com>
发送时间: 2023年9月17日 7:58
收件人: user <user@flink.apache.org>
主题: HA in k8s operator

Hi community,
I would like to test flink k8s operator's HA capabilities for TM and JM 
failover.

The simple test I did for TM failover was as follows:
- run Flink session cluster in native mode
- submit FlinkSessionJob resource with SAVEPOINT upgreade mode.
- kill task manager pod

It turns out that after I killed the TM, k8s operator does not create a new TM 
that would replace the killed one. The job was canceled and landed in Job 
Status -> Failed.

I had an impression that for TM HA no extra configuration is needed.
I have found [1] and [2]. But I'm not sure if this is for JM failvoer only or 
both, TM and JM. Also it is not clear for me if when using flink k8s operat do 
I still need to configure [1]?

[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#kubernetes-ha-services
[2] 
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#leader-election-and-high-availability

Regards,
Krzysztof Chmielewski


Reply via email to