Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Eleanore Jin Thu, 06 Aug 2020 09:31:27 -0700

Hi Yang,

Thanks a lot for the information!


Eleanore

On Thu, Aug 6, 2020 at 4:20 AM Yang Wang <danrtsey...@gmail.com> wrote:

> Hi Eleanore,
>
> From my experience, collecting the Flink metrics to prometheus via metrics
> collector is a more ideal way. It is
> also easier to configure the alert.
> Maybe you could use "fullRestarts" or "numRestarts" to monitor the job
> restarting. More metrics could be find
> here[2].
>
> [1].
> https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter
> [2].
> https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#availability
>
> Best,
> Yang
>
> Eleanore Jin <eleanore....@gmail.com> 于2020年8月5日周三 下午11:52写道：
>
>> Hi Yang and Till,
>>
>> Thanks a lot for the help! I have the similar question as Till mentioned,
>> if we do not fail Flink pods when the restart strategy is exhausted, it
>> might be hard to monitor such failures. Today I get alerts if the k8s pods
>> are restarted or in crash loop, but if this will no longer be the case, how
>> can we deal with the monitoring? In production, I have hundreds of small
>> flink jobs running (2-8 TM pods) doing stateless processing, it is really
>> hard for us to expose ingress for each JM rest endpoint to periodically
>> query the job status for each flink job.
>>
>> Thanks a lot!
>> Eleanore
>>
>> On Wed, Aug 5, 2020 at 4:56 AM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>>> You are right Yang Wang.
>>>
>>> Thanks for creating this issue.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Wed, Aug 5, 2020 at 1:33 PM Yang Wang <danrtsey...@gmail.com> wrote:
>>>
>>>> Actually, the application status shows in YARN web UI is not determined
>>>> by the jobmanager process exit code.
>>>> Instead, we use "resourceManagerClient.unregisterApplicationMaster" to
>>>> control the final status of YARN application.
>>>> So although jobmanager exit with zero code, it still could show failed
>>>> status in YARN web UI.
>>>>
>>>> I have created a ticket to track this improvement[1].
>>>>
>>>> [1]. https://issues.apache.org/jira/browse/FLINK-18828
>>>>
>>>>
>>>> Best,
>>>> Yang
>>>>
>>>>
>>>> Till Rohrmann <trohrm...@apache.org> 于2020年8月5日周三 下午3:56写道：
>>>>
>>>>> Yes for the other deployments it is not a problem. A reason why people
>>>>> preferred non-zero exit codes in case of FAILED jobs is that this is 
>>>>> easier
>>>>> to monitor than having to take a look at the actual job result. Moreover,
>>>>> in the YARN web UI the application shows as failed if I am not mistaken.
>>>>> However, from a framework's perspective, a FAILED job does not mean that
>>>>> Flink has failed and, hence, the return code could still be 0 in my 
>>>>> opinion.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang <danrtsey...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Eleanore,
>>>>>>
>>>>>> Yes, I suggest to use Job to replace Deployment. It could be used
>>>>>> to run jobmanager one time and finish after a successful/failed 
>>>>>> completion.
>>>>>>
>>>>>> However, using Job still could not solve your problem completely.
>>>>>> Just as Till said, When a job exhausts the restart strategy, the 
>>>>>> jobmanager
>>>>>> pod will terminate with non-zero exit code. It will cause the K8s
>>>>>> restarting it again. Even though we could set the resartPolicy and
>>>>>> backoffLimit,
>>>>>> this is not a clean and correct way to go. We should terminate the
>>>>>> jobmanager process with zero exit code in such situation.
>>>>>>
>>>>>> @Till Rohrmann <trohrm...@apache.org> I just have one concern. Is it
>>>>>> a special case for K8s deployment? For standalone/Yarn/Mesos, it seems 
>>>>>> that
>>>>>> terminating with
>>>>>> non-zero exit code is harmless.
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Yang
>>>>>>
>>>>>> Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 下午11:54写道：
>>>>>>
>>>>>>> Hi Yang & Till,
>>>>>>>
>>>>>>> Thanks for your prompt reply!
>>>>>>>
>>>>>>> Yang, regarding your question, I am actually not using k8s job, as I
>>>>>>> put my app.jar and its dependencies under flink's lib directory. I have 
>>>>>>> 1
>>>>>>> k8s deployment for job manager, and 1 k8s deployment for task manager, 
>>>>>>> and
>>>>>>> 1 k8s service for job manager.
>>>>>>>
>>>>>>> As you mentioned above, if flink job is marked as failed, it will
>>>>>>> cause the job manager pod to be restarted. Which is not the ideal
>>>>>>> behavior.
>>>>>>>
>>>>>>> Do you suggest that I should change the deployment strategy from
>>>>>>> using k8s deployment to k8s job? In case the flink program exit with
>>>>>>> non-zero code (e.g. exhausted number of configured restart), pod can be
>>>>>>> marked as complete hence not restarting the job again?
>>>>>>>
>>>>>>> Thanks a lot!
>>>>>>> Eleanore
>>>>>>>
>>>>>>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <danrtsey...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> @Till Rohrmann <trohrm...@apache.org> In native mode, when a Flink
>>>>>>>> application terminates with FAILED state, all the resources will be 
>>>>>>>> cleaned
>>>>>>>> up.
>>>>>>>>
>>>>>>>> However, in standalone mode, I agree with you that we need to
>>>>>>>> rethink the exit code of Flink. When a job exhausts the restart
>>>>>>>> strategy, we should terminate the pod and do not restart again.
>>>>>>>> After googling, it seems that we could not specify the restartPolicy
>>>>>>>> based on exit code[1]. So maybe we need to return a zero exit code
>>>>>>>> to avoid restarting by K8s.
>>>>>>>>
>>>>>>>> [1].
>>>>>>>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>> Till Rohrmann <trohrm...@apache.org> 于2020年8月4日周二 下午3:48写道：
>>>>>>>>
>>>>>>>>> @Yang Wang <danrtsey...@gmail.com> I believe that we should
>>>>>>>>> rethink the exit codes of Flink. In general you want K8s to restart a
>>>>>>>>> failed Flink process. Hence, an application which terminates in state
>>>>>>>>> FAILED should not return a non-zero exit code because it is a valid
>>>>>>>>> termination state.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Till
>>>>>>>>>
>>>>>>>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <danrtsey...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Eleanore,
>>>>>>>>>>
>>>>>>>>>> I think you are using K8s resource "Job" to deploy the
>>>>>>>>>> jobmanager. Please set .spec.template.spec.restartPolicy = "Never" 
>>>>>>>>>> and
>>>>>>>>>> spec.backoffLimit = 0.
>>>>>>>>>> Refer here[1] for more information.
>>>>>>>>>>
>>>>>>>>>> Then, when the jobmanager failed because of any reason, the K8s
>>>>>>>>>> job will be marked failed. And K8s will not restart the job again.
>>>>>>>>>>
>>>>>>>>>> [1].
>>>>>>>>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Yang
>>>>>>>>>>
>>>>>>>>>> Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 上午12:05写道：
>>>>>>>>>>
>>>>>>>>>>> Hi Till,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the reply!
>>>>>>>>>>>
>>>>>>>>>>> I manually deploy as per-job mode [1] and I am using Flink
>>>>>>>>>>> 1.8.2. Specifically, I build a custom docker image, which I copied 
>>>>>>>>>>> the app
>>>>>>>>>>> jar (not uber jar) and all its dependencies under /flink/lib.
>>>>>>>>>>>
>>>>>>>>>>> So my question is more like, in this case, if the job is marked
>>>>>>>>>>> as FAILED, which causes k8s to restart the pod, this seems not help 
>>>>>>>>>>> at all,
>>>>>>>>>>> what are the suggestions for such scenario?
>>>>>>>>>>>
>>>>>>>>>>> Thanks a lot!
>>>>>>>>>>> Eleanore
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <
>>>>>>>>>>> trohrm...@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Eleanore,
>>>>>>>>>>>>
>>>>>>>>>>>> how are you deploying Flink exactly? Are you using the
>>>>>>>>>>>> application mode with native K8s support to deploy a cluster [1] 
>>>>>>>>>>>> or are you
>>>>>>>>>>>> manually deploying a per-job mode [2]?
>>>>>>>>>>>>
>>>>>>>>>>>> I believe the problem might be that we terminate the Flink
>>>>>>>>>>>> process with a non-zero exit code if the job reaches the
>>>>>>>>>>>> ApplicationStatus.FAILED [3].
>>>>>>>>>>>>
>>>>>>>>>>>> cc Yang Wang have you observed a similar behavior when running
>>>>>>>>>>>> Flink in per-job mode on K8s?
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions
>>>>>>>>>>>> [3]
>>>>>>>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <
>>>>>>>>>>>> eleanore....@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Experts,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have a flink cluster (per job mode) running on kubernetes.
>>>>>>>>>>>>> The job is configured with restart strategy
>>>>>>>>>>>>>
>>>>>>>>>>>>> restart-strategy.fixed-delay.attempts: 
>>>>>>>>>>>>> 3restart-strategy.fixed-delay.delay: 10 s
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> So after 3 times retry, the job will be marked as FAILED,
>>>>>>>>>>>>> hence the pods are not running. However, kubernetes will then 
>>>>>>>>>>>>> restart the
>>>>>>>>>>>>> job again as the available replicas do not match the desired one.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I wonder what are the suggestions for such a scenario? How
>>>>>>>>>>>>> should I configure the flink job running on k8s?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks a lot!
>>>>>>>>>>>>> Eleanore
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

回复