Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Till Rohrmann Wed, 05 Aug 2020 04:57:38 -0700

You are right Yang Wang.

Thanks for creating this issue.


Cheers,
Till

On Wed, Aug 5, 2020 at 1:33 PM Yang Wang <danrtsey...@gmail.com> wrote:

> Actually, the application status shows in YARN web UI is not determined by
> the jobmanager process exit code.
> Instead, we use "resourceManagerClient.unregisterApplicationMaster" to
> control the final status of YARN application.
> So although jobmanager exit with zero code, it still could show failed
> status in YARN web UI.
>
> I have created a ticket to track this improvement[1].
>
> [1]. https://issues.apache.org/jira/browse/FLINK-18828
>
>
> Best,
> Yang
>
>
> Till Rohrmann <trohrm...@apache.org> 于2020年8月5日周三 下午3:56写道：
>
>> Yes for the other deployments it is not a problem. A reason why people
>> preferred non-zero exit codes in case of FAILED jobs is that this is easier
>> to monitor than having to take a look at the actual job result. Moreover,
>> in the YARN web UI the application shows as failed if I am not mistaken.
>> However, from a framework's perspective, a FAILED job does not mean that
>> Flink has failed and, hence, the return code could still be 0 in my opinion.
>>
>> Cheers,
>> Till
>>
>> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang <danrtsey...@gmail.com> wrote:
>>
>>> Hi Eleanore,
>>>
>>> Yes, I suggest to use Job to replace Deployment. It could be used to run
>>> jobmanager one time and finish after a successful/failed completion.
>>>
>>> However, using Job still could not solve your problem completely. Just
>>> as Till said, When a job exhausts the restart strategy, the jobmanager
>>> pod will terminate with non-zero exit code. It will cause the K8s
>>> restarting it again. Even though we could set the resartPolicy and
>>> backoffLimit,
>>> this is not a clean and correct way to go. We should terminate the
>>> jobmanager process with zero exit code in such situation.
>>>
>>> @Till Rohrmann <trohrm...@apache.org> I just have one concern. Is it a
>>> special case for K8s deployment? For standalone/Yarn/Mesos, it seems that
>>> terminating with
>>> non-zero exit code is harmless.
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 下午11:54写道：
>>>
>>>> Hi Yang & Till,
>>>>
>>>> Thanks for your prompt reply!
>>>>
>>>> Yang, regarding your question, I am actually not using k8s job, as I
>>>> put my app.jar and its dependencies under flink's lib directory. I have 1
>>>> k8s deployment for job manager, and 1 k8s deployment for task manager, and
>>>> 1 k8s service for job manager.
>>>>
>>>> As you mentioned above, if flink job is marked as failed, it will cause
>>>> the job manager pod to be restarted. Which is not the ideal behavior.
>>>>
>>>> Do you suggest that I should change the deployment strategy from using
>>>> k8s deployment to k8s job? In case the flink program exit with non-zero
>>>> code (e.g. exhausted number of configured restart), pod can be marked as
>>>> complete hence not restarting the job again?
>>>>
>>>> Thanks a lot!
>>>> Eleanore
>>>>
>>>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <danrtsey...@gmail.com> wrote:
>>>>
>>>>> @Till Rohrmann <trohrm...@apache.org> In native mode, when a Flink
>>>>> application terminates with FAILED state, all the resources will be 
>>>>> cleaned
>>>>> up.
>>>>>
>>>>> However, in standalone mode, I agree with you that we need to rethink
>>>>> the exit code of Flink. When a job exhausts the restart
>>>>> strategy, we should terminate the pod and do not restart again. After
>>>>> googling, it seems that we could not specify the restartPolicy
>>>>> based on exit code[1]. So maybe we need to return a zero exit code to
>>>>> avoid restarting by K8s.
>>>>>
>>>>> [1].
>>>>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code
>>>>>
>>>>> Best,
>>>>> Yang
>>>>>
>>>>> Till Rohrmann <trohrm...@apache.org> 于2020年8月4日周二 下午3:48写道：
>>>>>
>>>>>> @Yang Wang <danrtsey...@gmail.com> I believe that we should
>>>>>> rethink the exit codes of Flink. In general you want K8s to restart a
>>>>>> failed Flink process. Hence, an application which terminates in state
>>>>>> FAILED should not return a non-zero exit code because it is a valid
>>>>>> termination state.
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <danrtsey...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Eleanore,
>>>>>>>
>>>>>>> I think you are using K8s resource "Job" to deploy the jobmanager.
>>>>>>> Please set .spec.template.spec.restartPolicy = "Never" and
>>>>>>> spec.backoffLimit = 0.
>>>>>>> Refer here[1] for more information.
>>>>>>>
>>>>>>> Then, when the jobmanager failed because of any reason, the K8s job
>>>>>>> will be marked failed. And K8s will not restart the job again.
>>>>>>>
>>>>>>> [1].
>>>>>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Yang
>>>>>>>
>>>>>>> Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 上午12:05写道：
>>>>>>>
>>>>>>>> Hi Till,
>>>>>>>>
>>>>>>>> Thanks for the reply!
>>>>>>>>
>>>>>>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
>>>>>>>> Specifically, I build a custom docker image, which I copied the app jar
>>>>>>>> (not uber jar) and all its dependencies under /flink/lib.
>>>>>>>>
>>>>>>>> So my question is more like, in this case, if the job is marked as
>>>>>>>> FAILED, which causes k8s to restart the pod, this seems not help at 
>>>>>>>> all,
>>>>>>>> what are the suggestions for such scenario?
>>>>>>>>
>>>>>>>> Thanks a lot!
>>>>>>>> Eleanore
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes
>>>>>>>>
>>>>>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <trohrm...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Eleanore,
>>>>>>>>>
>>>>>>>>> how are you deploying Flink exactly? Are you using the application
>>>>>>>>> mode with native K8s support to deploy a cluster [1] or are you 
>>>>>>>>> manually
>>>>>>>>> deploying a per-job mode [2]?
>>>>>>>>>
>>>>>>>>> I believe the problem might be that we terminate the Flink process
>>>>>>>>> with a non-zero exit code if the job reaches the 
>>>>>>>>> ApplicationStatus.FAILED
>>>>>>>>> [3].
>>>>>>>>>
>>>>>>>>> cc Yang Wang have you observed a similar behavior when running
>>>>>>>>> Flink in per-job mode on K8s?
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application
>>>>>>>>> [2]
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions
>>>>>>>>> [3]
>>>>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32
>>>>>>>>>
>>>>>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <
>>>>>>>>> eleanore....@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Experts,
>>>>>>>>>>
>>>>>>>>>> I have a flink cluster (per job mode) running on kubernetes. The
>>>>>>>>>> job is configured with restart strategy
>>>>>>>>>>
>>>>>>>>>> restart-strategy.fixed-delay.attempts: 
>>>>>>>>>> 3restart-strategy.fixed-delay.delay: 10 s
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So after 3 times retry, the job will be marked as FAILED, hence
>>>>>>>>>> the pods are not running. However, kubernetes will then restart the 
>>>>>>>>>> job
>>>>>>>>>> again as the available replicas do not match the desired one.
>>>>>>>>>>
>>>>>>>>>> I wonder what are the suggestions for such a scenario? How should
>>>>>>>>>> I configure the flink job running on k8s?
>>>>>>>>>>
>>>>>>>>>> Thanks a lot!
>>>>>>>>>> Eleanore
>>>>>>>>>>
>>>>>>>>>

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

回复