Hi Chen, I now see what you was trying to tell me. The problem was on my end... sorry for that. The job I was using for session cluster had NoRestart() set as Restart Strategy, whereas Application Cluster was execution job with some "proper" restart strategy.
Thanks. Krzysztof Chmielewski niedz., 17 wrz 2023 o 12:02 Krzysztof Chmielewski < krzysiek.chmielew...@gmail.com> napisał(a): > The thing is that when I've deployed an application cluster like in > example [1] without any extra configuration and then I killed the TM, > submitted job was moved to "RESTARTING state and then new TM was created > after which job was running again. This is a different behavior that i see > when I'm running session cluster [2]. > > How I can enable TM HA for session cluster? > > [1] > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/overview/#application-deployments > [2] > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/overview/#session-cluster-deployments > > > > niedz., 17 wrz 2023 o 10:32 Krzysztof Chmielewski < > krzysiek.chmielew...@gmail.com> napisał(a): > >> Thank you, >> so in other words to have TM HA on k8s I have to configure [1] correct? >> >> [1] >> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#kubernetes-ha-services >> >> niedz., 17 wrz 2023 o 07:27 Chen Zhanghao <zhanghao.c...@outlook.com> >> napisał(a): >> >>> Hi Krzysztof, >>> >>> TM HA is taken charge by the Flink cluster itself is beyond K8s >>> operator's responsibility. Flink will try to recover a failed Task as long >>> as the restart limit is not reached otherwise the job will transition into >>> terminal FAILED status. You may check the job restart strategy [1] for more >>> details. >>> >>> [1] >>> https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/state/task_failure_recovery/#restart-strategies >>> >>> Best, >>> Zhanghao Chen >>> ------------------------------ >>> *发件人:* Krzysztof Chmielewski <krzysiek.chmielew...@gmail.com> >>> *发送时间:* 2023年9月17日 7:58 >>> *收件人:* user <user@flink.apache.org> >>> *主题:* HA in k8s operator >>> >>> Hi community, >>> I would like to test flink k8s operator's HA capabilities for TM and JM >>> failover. >>> >>> The simple test I did for TM failover was as follows: >>> - run Flink session cluster in native mode >>> - submit FlinkSessionJob resource with SAVEPOINT upgreade mode. >>> - kill task manager pod >>> >>> It turns out that after I killed the TM, k8s operator does not create a >>> new TM that would replace the killed one. The job was canceled and landed >>> in Job Status -> Failed. >>> >>> I had an impression that for TM HA no extra configuration is needed. >>> I have found [1] and [2]. But I'm not sure if this is for JM failvoer >>> only or both, TM and JM. Also it is not clear for me if when using flink >>> k8s operat do I still need to configure [1]? >>> >>> [1] >>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#kubernetes-ha-services >>> [2] >>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#leader-election-and-high-availability >>> >>> Regards, >>> Krzysztof Chmielewski >>> >>> >>>