Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Yang Wang Wed, 05 Aug 2020 00:30:49 -0700

Hi Eleanore,

Yes, I suggest to use Job to replace Deployment. It could be used to run
jobmanager one time and finish after a successful/failed completion.


However, using Job still could not solve your problem completely. Just as
Till said, When a job exhausts the restart strategy, the jobmanager
pod will terminate with non-zero exit code. It will cause the K8s
restarting it again. Even though we could set the resartPolicy and
backoffLimit,
this is not a clean and correct way to go. We should terminate the
jobmanager process with zero exit code in such situation.

@Till Rohrmann <trohrm...@apache.org> I just have one concern. Is it a
special case for K8s deployment? For standalone/Yarn/Mesos, it seems that
terminating with
non-zero exit code is harmless.


Best,
Yang

Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 下午11:54写道：

> Hi Yang & Till,
>
> Thanks for your prompt reply!
>
> Yang, regarding your question, I am actually not using k8s job, as I put
> my app.jar and its dependencies under flink's lib directory. I have 1 k8s
> deployment for job manager, and 1 k8s deployment for task manager, and 1
> k8s service for job manager.
>
> As you mentioned above, if flink job is marked as failed, it will cause
> the job manager pod to be restarted. Which is not the ideal behavior.
>
> Do you suggest that I should change the deployment strategy from using k8s
> deployment to k8s job? In case the flink program exit with non-zero code
> (e.g. exhausted number of configured restart), pod can be marked as
> complete hence not restarting the job again?
>
> Thanks a lot!
> Eleanore
>
> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <danrtsey...@gmail.com> wrote:
>
>> @Till Rohrmann <trohrm...@apache.org> In native mode, when a Flink
>> application terminates with FAILED state, all the resources will be cleaned
>> up.
>>
>> However, in standalone mode, I agree with you that we need to rethink the
>> exit code of Flink. When a job exhausts the restart
>> strategy, we should terminate the pod and do not restart again. After
>> googling, it seems that we could not specify the restartPolicy
>> based on exit code[1]. So maybe we need to return a zero exit code to
>> avoid restarting by K8s.
>>
>> [1].
>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code
>>
>> Best,
>> Yang
>>
>> Till Rohrmann <trohrm...@apache.org> 于2020年8月4日周二 下午3:48写道：
>>
>>> @Yang Wang <danrtsey...@gmail.com> I believe that we should rethink the
>>> exit codes of Flink. In general you want K8s to restart a failed Flink
>>> process. Hence, an application which terminates in state FAILED should not
>>> return a non-zero exit code because it is a valid termination state.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <danrtsey...@gmail.com> wrote:
>>>
>>>> Hi Eleanore,
>>>>
>>>> I think you are using K8s resource "Job" to deploy the jobmanager.
>>>> Please set .spec.template.spec.restartPolicy = "Never" and
>>>> spec.backoffLimit = 0.
>>>> Refer here[1] for more information.
>>>>
>>>> Then, when the jobmanager failed because of any reason, the K8s job
>>>> will be marked failed. And K8s will not restart the job again.
>>>>
>>>> [1].
>>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
>>>>
>>>>
>>>> Best,
>>>> Yang
>>>>
>>>> Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 上午12:05写道：
>>>>
>>>>> Hi Till,
>>>>>
>>>>> Thanks for the reply!
>>>>>
>>>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
>>>>> Specifically, I build a custom docker image, which I copied the app jar
>>>>> (not uber jar) and all its dependencies under /flink/lib.
>>>>>
>>>>> So my question is more like, in this case, if the job is marked as
>>>>> FAILED, which causes k8s to restart the pod, this seems not help at all,
>>>>> what are the suggestions for such scenario?
>>>>>
>>>>> Thanks a lot!
>>>>> Eleanore
>>>>>
>>>>> [1]
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes
>>>>>
>>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <trohrm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Eleanore,
>>>>>>
>>>>>> how are you deploying Flink exactly? Are you using the application
>>>>>> mode with native K8s support to deploy a cluster [1] or are you manually
>>>>>> deploying a per-job mode [2]?
>>>>>>
>>>>>> I believe the problem might be that we terminate the Flink process
>>>>>> with a non-zero exit code if the job reaches the ApplicationStatus.FAILED
>>>>>> [3].
>>>>>>
>>>>>> cc Yang Wang have you observed a similar behavior when running Flink
>>>>>> in per-job mode on K8s?
>>>>>>
>>>>>> [1]
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application
>>>>>> [2]
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions
>>>>>> [3]
>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32
>>>>>>
>>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <eleanore....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Experts,
>>>>>>>
>>>>>>> I have a flink cluster (per job mode) running on kubernetes. The job
>>>>>>> is configured with restart strategy
>>>>>>>
>>>>>>> restart-strategy.fixed-delay.attempts: 
>>>>>>> 3restart-strategy.fixed-delay.delay: 10 s
>>>>>>>
>>>>>>>
>>>>>>> So after 3 times retry, the job will be marked as FAILED, hence the
>>>>>>> pods are not running. However, kubernetes will then restart the job 
>>>>>>> again
>>>>>>> as the available replicas do not match the desired one.
>>>>>>>
>>>>>>> I wonder what are the suggestions for such a scenario? How should I
>>>>>>> configure the flink job running on k8s?
>>>>>>>
>>>>>>> Thanks a lot!
>>>>>>> Eleanore
>>>>>>>
>>>>>>

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Reply via email to