Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Yang Wang Thu, 06 Aug 2020 04:21:07 -0700

Hi Eleanore,

>From my experience, collecting the Flink metrics to prometheus via metrics
collector is a more ideal way. It is
also easier to configure the alert.
Maybe you could use "fullRestarts" or "numRestarts" to monitor the job
restarting. More metrics could be find
here[2].


[1].
https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter
[2].
https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#availability

Best,
Yang

Eleanore Jin <[email protected]> 于2020年8月5日周三 下午11:52写道：

> Hi Yang and Till,
>
> Thanks a lot for the help! I have the similar question as Till mentioned,
> if we do not fail Flink pods when the restart strategy is exhausted, it
> might be hard to monitor such failures. Today I get alerts if the k8s pods
> are restarted or in crash loop, but if this will no longer be the case, how
> can we deal with the monitoring? In production, I have hundreds of small
> flink jobs running (2-8 TM pods) doing stateless processing, it is really
> hard for us to expose ingress for each JM rest endpoint to periodically
> query the job status for each flink job.
>
> Thanks a lot!
> Eleanore
>
> On Wed, Aug 5, 2020 at 4:56 AM Till Rohrmann <[email protected]> wrote:
>
>> You are right Yang Wang.
>>
>> Thanks for creating this issue.
>>
>> Cheers,
>> Till
>>
>> On Wed, Aug 5, 2020 at 1:33 PM Yang Wang <[email protected]> wrote:
>>
>>> Actually, the application status shows in YARN web UI is not determined
>>> by the jobmanager process exit code.
>>> Instead, we use "resourceManagerClient.unregisterApplicationMaster" to
>>> control the final status of YARN application.
>>> So although jobmanager exit with zero code, it still could show failed
>>> status in YARN web UI.
>>>
>>> I have created a ticket to track this improvement[1].
>>>
>>> [1]. https://issues.apache.org/jira/browse/FLINK-18828
>>>
>>>
>>> Best,
>>> Yang
>>>
>>>
>>> Till Rohrmann <[email protected]> 于2020年8月5日周三 下午3:56写道：
>>>
>>>> Yes for the other deployments it is not a problem. A reason why people
>>>> preferred non-zero exit codes in case of FAILED jobs is that this is easier
>>>> to monitor than having to take a look at the actual job result. Moreover,
>>>> in the YARN web UI the application shows as failed if I am not mistaken.
>>>> However, from a framework's perspective, a FAILED job does not mean that
>>>> Flink has failed and, hence, the return code could still be 0 in my 
>>>> opinion.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang <[email protected]> wrote:
>>>>
>>>>> Hi Eleanore,
>>>>>
>>>>> Yes, I suggest to use Job to replace Deployment. It could be used
>>>>> to run jobmanager one time and finish after a successful/failed 
>>>>> completion.
>>>>>
>>>>> However, using Job still could not solve your problem completely. Just
>>>>> as Till said, When a job exhausts the restart strategy, the jobmanager
>>>>> pod will terminate with non-zero exit code. It will cause the K8s
>>>>> restarting it again. Even though we could set the resartPolicy and
>>>>> backoffLimit,
>>>>> this is not a clean and correct way to go. We should terminate the
>>>>> jobmanager process with zero exit code in such situation.
>>>>>
>>>>> @Till Rohrmann <[email protected]> I just have one concern. Is it
>>>>> a special case for K8s deployment? For standalone/Yarn/Mesos, it seems 
>>>>> that
>>>>> terminating with
>>>>> non-zero exit code is harmless.
>>>>>
>>>>>
>>>>> Best,
>>>>> Yang
>>>>>
>>>>> Eleanore Jin <[email protected]> 于2020年8月4日周二 下午11:54写道：
>>>>>
>>>>>> Hi Yang & Till,
>>>>>>
>>>>>> Thanks for your prompt reply!
>>>>>>
>>>>>> Yang, regarding your question, I am actually not using k8s job, as I
>>>>>> put my app.jar and its dependencies under flink's lib directory. I have 1
>>>>>> k8s deployment for job manager, and 1 k8s deployment for task manager, 
>>>>>> and
>>>>>> 1 k8s service for job manager.
>>>>>>
>>>>>> As you mentioned above, if flink job is marked as failed, it will
>>>>>> cause the job manager pod to be restarted. Which is not the ideal
>>>>>> behavior.
>>>>>>
>>>>>> Do you suggest that I should change the deployment strategy from
>>>>>> using k8s deployment to k8s job? In case the flink program exit with
>>>>>> non-zero code (e.g. exhausted number of configured restart), pod can be
>>>>>> marked as complete hence not restarting the job again?
>>>>>>
>>>>>> Thanks a lot!
>>>>>> Eleanore
>>>>>>
>>>>>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> @Till Rohrmann <[email protected]> In native mode, when a Flink
>>>>>>> application terminates with FAILED state, all the resources will be 
>>>>>>> cleaned
>>>>>>> up.
>>>>>>>
>>>>>>> However, in standalone mode, I agree with you that we need to
>>>>>>> rethink the exit code of Flink. When a job exhausts the restart
>>>>>>> strategy, we should terminate the pod and do not restart again.
>>>>>>> After googling, it seems that we could not specify the restartPolicy
>>>>>>> based on exit code[1]. So maybe we need to return a zero exit code
>>>>>>> to avoid restarting by K8s.
>>>>>>>
>>>>>>> [1].
>>>>>>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code
>>>>>>>
>>>>>>> Best,
>>>>>>> Yang
>>>>>>>
>>>>>>> Till Rohrmann <[email protected]> 于2020年8月4日周二 下午3:48写道：
>>>>>>>
>>>>>>>> @Yang Wang <[email protected]> I believe that we should
>>>>>>>> rethink the exit codes of Flink. In general you want K8s to restart a
>>>>>>>> failed Flink process. Hence, an application which terminates in state
>>>>>>>> FAILED should not return a non-zero exit code because it is a valid
>>>>>>>> termination state.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>>
>>>>>>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Eleanore,
>>>>>>>>>
>>>>>>>>> I think you are using K8s resource "Job" to deploy the jobmanager.
>>>>>>>>> Please set .spec.template.spec.restartPolicy = "Never" and
>>>>>>>>> spec.backoffLimit = 0.
>>>>>>>>> Refer here[1] for more information.
>>>>>>>>>
>>>>>>>>> Then, when the jobmanager failed because of any reason, the K8s
>>>>>>>>> job will be marked failed. And K8s will not restart the job again.
>>>>>>>>>
>>>>>>>>> [1].
>>>>>>>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>> Eleanore Jin <[email protected]> 于2020年8月4日周二 上午12:05写道：
>>>>>>>>>
>>>>>>>>>> Hi Till,
>>>>>>>>>>
>>>>>>>>>> Thanks for the reply!
>>>>>>>>>>
>>>>>>>>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
>>>>>>>>>> Specifically, I build a custom docker image, which I copied the app 
>>>>>>>>>> jar
>>>>>>>>>> (not uber jar) and all its dependencies under /flink/lib.
>>>>>>>>>>
>>>>>>>>>> So my question is more like, in this case, if the job is marked
>>>>>>>>>> as FAILED, which causes k8s to restart the pod, this seems not help 
>>>>>>>>>> at all,
>>>>>>>>>> what are the suggestions for such scenario?
>>>>>>>>>>
>>>>>>>>>> Thanks a lot!
>>>>>>>>>> Eleanore
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Eleanore,
>>>>>>>>>>>
>>>>>>>>>>> how are you deploying Flink exactly? Are you using the
>>>>>>>>>>> application mode with native K8s support to deploy a cluster [1] or 
>>>>>>>>>>> are you
>>>>>>>>>>> manually deploying a per-job mode [2]?
>>>>>>>>>>>
>>>>>>>>>>> I believe the problem might be that we terminate the Flink
>>>>>>>>>>> process with a non-zero exit code if the job reaches the
>>>>>>>>>>> ApplicationStatus.FAILED [3].
>>>>>>>>>>>
>>>>>>>>>>> cc Yang Wang have you observed a similar behavior when running
>>>>>>>>>>> Flink in per-job mode on K8s?
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application
>>>>>>>>>>> [2]
>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions
>>>>>>>>>>> [3]
>>>>>>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Experts,
>>>>>>>>>>>>
>>>>>>>>>>>> I have a flink cluster (per job mode) running on kubernetes.
>>>>>>>>>>>> The job is configured with restart strategy
>>>>>>>>>>>>
>>>>>>>>>>>> restart-strategy.fixed-delay.attempts: 
>>>>>>>>>>>> 3restart-strategy.fixed-delay.delay: 10 s
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> So after 3 times retry, the job will be marked as FAILED, hence
>>>>>>>>>>>> the pods are not running. However, kubernetes will then restart 
>>>>>>>>>>>> the job
>>>>>>>>>>>> again as the available replicas do not match the desired one.
>>>>>>>>>>>>
>>>>>>>>>>>> I wonder what are the suggestions for such a scenario? How
>>>>>>>>>>>> should I configure the flink job running on k8s?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot!
>>>>>>>>>>>> Eleanore
>>>>>>>>>>>>
>>>>>>>>>>>

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Reply via email to