Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Yang Wang Mon, 03 Aug 2020 23:54:47 -0700

Hi Eleanore,

I think you are using K8s resource "Job" to deploy the jobmanager. Please
set .spec.template.spec.restartPolicy = "Never" and spec.backoffLimit = 0.
Refer here[1] for more information.


Then, when the jobmanager failed because of any reason, the K8s job will be
marked failed. And K8s will not restart the job again.

[1].
https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup


Best,
Yang

Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 上午12:05写道：

> Hi Till,
>
> Thanks for the reply!
>
> I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
> Specifically, I build a custom docker image, which I copied the app jar
> (not uber jar) and all its dependencies under /flink/lib.
>
> So my question is more like, in this case, if the job is marked as FAILED,
> which causes k8s to restart the pod, this seems not help at all, what are
> the suggestions for such scenario?
>
> Thanks a lot!
> Eleanore
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes
>
> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <trohrm...@apache.org> wrote:
>
>> Hi Eleanore,
>>
>> how are you deploying Flink exactly? Are you using the application mode
>> with native K8s support to deploy a cluster [1] or are you manually
>> deploying a per-job mode [2]?
>>
>> I believe the problem might be that we terminate the Flink process with a
>> non-zero exit code if the job reaches the ApplicationStatus.FAILED [3].
>>
>> cc Yang Wang have you observed a similar behavior when running Flink in
>> per-job mode on K8s?
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions
>> [3]
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32
>>
>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <eleanore....@gmail.com>
>> wrote:
>>
>>> Hi Experts,
>>>
>>> I have a flink cluster (per job mode) running on kubernetes. The job is
>>> configured with restart strategy
>>>
>>> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 
>>> 10 s
>>>
>>>
>>> So after 3 times retry, the job will be marked as FAILED, hence the pods
>>> are not running. However, kubernetes will then restart the job again as the
>>> available replicas do not match the desired one.
>>>
>>> I wonder what are the suggestions for such a scenario? How should I
>>> configure the flink job running on k8s?
>>>
>>> Thanks a lot!
>>> Eleanore
>>>
>>

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

回复