Re: HA in k8s operator

Krzysztof Chmielewski Sun, 17 Sep 2023 03:02:59 -0700

The thing is that when I've deployed an application cluster like in example
[1] without any extra configuration and then I killed the TM, submitted job
was moved to "RESTARTING state and then new TM was created after which job
was running again. This is a different behavior that i see when I'm running
session cluster [2].


How I can enable TM HA for session cluster?

[1]
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/overview/#application-deployments
[2]
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/overview/#session-cluster-deployments



niedz., 17 wrz 2023 o 10:32 Krzysztof Chmielewski <
[email protected]> napisał(a):

> Thank you,
> so in other words to have TM HA on k8s I have to configure [1] correct?
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#kubernetes-ha-services
>
> niedz., 17 wrz 2023 o 07:27 Chen Zhanghao <[email protected]>
> napisał(a):
>
>> Hi Krzysztof,
>>
>> TM HA is taken charge by the Flink cluster itself is beyond K8s
>> operator's responsibility. Flink will try to recover a failed Task as long
>> as the restart limit is not reached otherwise the job will transition into
>> terminal FAILED status. You may check the job restart strategy [1] for more
>> details.
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/state/task_failure_recovery/#restart-strategies
>>
>> Best,
>> Zhanghao Chen
>> ------------------------------
>> *发件人:* Krzysztof Chmielewski <[email protected]>
>> *发送时间:* 2023年9月17日 7:58
>> *收件人:* user <[email protected]>
>> *主题:* HA in k8s operator
>>
>> Hi community,
>> I would like to test flink k8s operator's HA capabilities for TM and JM
>> failover.
>>
>> The simple test I did for TM failover was as follows:
>> - run Flink session cluster in native mode
>> - submit FlinkSessionJob resource with SAVEPOINT upgreade mode.
>> - kill task manager pod
>>
>> It turns out that after I killed the TM, k8s operator does not create a
>> new TM that would replace the killed one. The job was canceled and landed
>> in Job Status -> Failed.
>>
>> I had an impression that for TM HA no extra configuration is needed.
>> I have found [1] and [2]. But I'm not sure if this is for JM failvoer
>> only or both, TM and JM. Also it is not clear for me if when using flink
>> k8s operat do I still need to configure [1]?
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#kubernetes-ha-services
>> [2]
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#leader-election-and-high-availability
>>
>> Regards,
>> Krzysztof Chmielewski
>>
>>
>>

Re: HA in k8s operator

Reply via email to