Re: HA in k8s operator

Krzysztof Chmielewski Sun, 17 Sep 2023 12:59:21 -0700

Hi Chen,
I now see what you was trying to tell me.

The problem was on my end... sorry for that. The job I was using for
session cluster had NoRestart() set as Restart Strategy, whereas
Application Cluster was execution job with some "proper" restart strategy.


Thanks.
Krzysztof Chmielewski

niedz., 17 wrz 2023 o 12:02 Krzysztof Chmielewski <
[email protected]> napisał(a):

> The thing is that when I've deployed an application cluster like in
> example [1] without any extra configuration and then I killed the TM,
> submitted job was moved to "RESTARTING state and then new TM was created
> after which job was running again. This is a different behavior that i see
> when I'm running session cluster [2].
>
> How I can enable TM HA for session cluster?
>
> [1]
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/overview/#application-deployments
> [2]
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/overview/#session-cluster-deployments
>
>
>
> niedz., 17 wrz 2023 o 10:32 Krzysztof Chmielewski <
> [email protected]> napisał(a):
>
>> Thank you,
>> so in other words to have TM HA on k8s I have to configure [1] correct?
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#kubernetes-ha-services
>>
>> niedz., 17 wrz 2023 o 07:27 Chen Zhanghao <[email protected]>
>> napisał(a):
>>
>>> Hi Krzysztof,
>>>
>>> TM HA is taken charge by the Flink cluster itself is beyond K8s
>>> operator's responsibility. Flink will try to recover a failed Task as long
>>> as the restart limit is not reached otherwise the job will transition into
>>> terminal FAILED status. You may check the job restart strategy [1] for more
>>> details.
>>>
>>> [1]
>>> https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/state/task_failure_recovery/#restart-strategies
>>>
>>> Best,
>>> Zhanghao Chen
>>> ------------------------------
>>> *发件人:* Krzysztof Chmielewski <[email protected]>
>>> *发送时间:* 2023年9月17日 7:58
>>> *收件人:* user <[email protected]>
>>> *主题:* HA in k8s operator
>>>
>>> Hi community,
>>> I would like to test flink k8s operator's HA capabilities for TM and JM
>>> failover.
>>>
>>> The simple test I did for TM failover was as follows:
>>> - run Flink session cluster in native mode
>>> - submit FlinkSessionJob resource with SAVEPOINT upgreade mode.
>>> - kill task manager pod
>>>
>>> It turns out that after I killed the TM, k8s operator does not create a
>>> new TM that would replace the killed one. The job was canceled and landed
>>> in Job Status -> Failed.
>>>
>>> I had an impression that for TM HA no extra configuration is needed.
>>> I have found [1] and [2]. But I'm not sure if this is for JM failvoer
>>> only or both, TM and JM. Also it is not clear for me if when using flink
>>> k8s operat do I still need to configure [1]?
>>>
>>> [1]
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#kubernetes-ha-services
>>> [2]
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#leader-election-and-high-availability
>>>
>>> Regards,
>>> Krzysztof Chmielewski
>>>
>>>
>>>

Re: HA in k8s operator

Reply via email to