Hi Yang, Thanks a lot for the information!
Eleanore On Thu, Aug 6, 2020 at 4:20 AM Yang Wang <danrtsey...@gmail.com> wrote: > Hi Eleanore, > > From my experience, collecting the Flink metrics to prometheus via metrics > collector is a more ideal way. It is > also easier to configure the alert. > Maybe you could use "fullRestarts" or "numRestarts" to monitor the job > restarting. More metrics could be find > here[2]. > > [1]. > https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter > [2]. > https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#availability > > Best, > Yang > > Eleanore Jin <eleanore....@gmail.com> 于2020年8月5日周三 下午11:52写道: > >> Hi Yang and Till, >> >> Thanks a lot for the help! I have the similar question as Till mentioned, >> if we do not fail Flink pods when the restart strategy is exhausted, it >> might be hard to monitor such failures. Today I get alerts if the k8s pods >> are restarted or in crash loop, but if this will no longer be the case, how >> can we deal with the monitoring? In production, I have hundreds of small >> flink jobs running (2-8 TM pods) doing stateless processing, it is really >> hard for us to expose ingress for each JM rest endpoint to periodically >> query the job status for each flink job. >> >> Thanks a lot! >> Eleanore >> >> On Wed, Aug 5, 2020 at 4:56 AM Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> You are right Yang Wang. >>> >>> Thanks for creating this issue. >>> >>> Cheers, >>> Till >>> >>> On Wed, Aug 5, 2020 at 1:33 PM Yang Wang <danrtsey...@gmail.com> wrote: >>> >>>> Actually, the application status shows in YARN web UI is not determined >>>> by the jobmanager process exit code. >>>> Instead, we use "resourceManagerClient.unregisterApplicationMaster" to >>>> control the final status of YARN application. >>>> So although jobmanager exit with zero code, it still could show failed >>>> status in YARN web UI. >>>> >>>> I have created a ticket to track this improvement[1]. >>>> >>>> [1]. https://issues.apache.org/jira/browse/FLINK-18828 >>>> >>>> >>>> Best, >>>> Yang >>>> >>>> >>>> Till Rohrmann <trohrm...@apache.org> 于2020年8月5日周三 下午3:56写道: >>>> >>>>> Yes for the other deployments it is not a problem. A reason why people >>>>> preferred non-zero exit codes in case of FAILED jobs is that this is >>>>> easier >>>>> to monitor than having to take a look at the actual job result. Moreover, >>>>> in the YARN web UI the application shows as failed if I am not mistaken. >>>>> However, from a framework's perspective, a FAILED job does not mean that >>>>> Flink has failed and, hence, the return code could still be 0 in my >>>>> opinion. >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang <danrtsey...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Eleanore, >>>>>> >>>>>> Yes, I suggest to use Job to replace Deployment. It could be used >>>>>> to run jobmanager one time and finish after a successful/failed >>>>>> completion. >>>>>> >>>>>> However, using Job still could not solve your problem completely. >>>>>> Just as Till said, When a job exhausts the restart strategy, the >>>>>> jobmanager >>>>>> pod will terminate with non-zero exit code. It will cause the K8s >>>>>> restarting it again. Even though we could set the resartPolicy and >>>>>> backoffLimit, >>>>>> this is not a clean and correct way to go. We should terminate the >>>>>> jobmanager process with zero exit code in such situation. >>>>>> >>>>>> @Till Rohrmann <trohrm...@apache.org> I just have one concern. Is it >>>>>> a special case for K8s deployment? For standalone/Yarn/Mesos, it seems >>>>>> that >>>>>> terminating with >>>>>> non-zero exit code is harmless. >>>>>> >>>>>> >>>>>> Best, >>>>>> Yang >>>>>> >>>>>> Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 下午11:54写道: >>>>>> >>>>>>> Hi Yang & Till, >>>>>>> >>>>>>> Thanks for your prompt reply! >>>>>>> >>>>>>> Yang, regarding your question, I am actually not using k8s job, as I >>>>>>> put my app.jar and its dependencies under flink's lib directory. I have >>>>>>> 1 >>>>>>> k8s deployment for job manager, and 1 k8s deployment for task manager, >>>>>>> and >>>>>>> 1 k8s service for job manager. >>>>>>> >>>>>>> As you mentioned above, if flink job is marked as failed, it will >>>>>>> cause the job manager pod to be restarted. Which is not the ideal >>>>>>> behavior. >>>>>>> >>>>>>> Do you suggest that I should change the deployment strategy from >>>>>>> using k8s deployment to k8s job? In case the flink program exit with >>>>>>> non-zero code (e.g. exhausted number of configured restart), pod can be >>>>>>> marked as complete hence not restarting the job again? >>>>>>> >>>>>>> Thanks a lot! >>>>>>> Eleanore >>>>>>> >>>>>>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <danrtsey...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> @Till Rohrmann <trohrm...@apache.org> In native mode, when a Flink >>>>>>>> application terminates with FAILED state, all the resources will be >>>>>>>> cleaned >>>>>>>> up. >>>>>>>> >>>>>>>> However, in standalone mode, I agree with you that we need to >>>>>>>> rethink the exit code of Flink. When a job exhausts the restart >>>>>>>> strategy, we should terminate the pod and do not restart again. >>>>>>>> After googling, it seems that we could not specify the restartPolicy >>>>>>>> based on exit code[1]. So maybe we need to return a zero exit code >>>>>>>> to avoid restarting by K8s. >>>>>>>> >>>>>>>> [1]. >>>>>>>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code >>>>>>>> >>>>>>>> Best, >>>>>>>> Yang >>>>>>>> >>>>>>>> Till Rohrmann <trohrm...@apache.org> 于2020年8月4日周二 下午3:48写道: >>>>>>>> >>>>>>>>> @Yang Wang <danrtsey...@gmail.com> I believe that we should >>>>>>>>> rethink the exit codes of Flink. In general you want K8s to restart a >>>>>>>>> failed Flink process. Hence, an application which terminates in state >>>>>>>>> FAILED should not return a non-zero exit code because it is a valid >>>>>>>>> termination state. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Till >>>>>>>>> >>>>>>>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <danrtsey...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Eleanore, >>>>>>>>>> >>>>>>>>>> I think you are using K8s resource "Job" to deploy the >>>>>>>>>> jobmanager. Please set .spec.template.spec.restartPolicy = "Never" >>>>>>>>>> and >>>>>>>>>> spec.backoffLimit = 0. >>>>>>>>>> Refer here[1] for more information. >>>>>>>>>> >>>>>>>>>> Then, when the jobmanager failed because of any reason, the K8s >>>>>>>>>> job will be marked failed. And K8s will not restart the job again. >>>>>>>>>> >>>>>>>>>> [1]. >>>>>>>>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Yang >>>>>>>>>> >>>>>>>>>> Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 上午12:05写道: >>>>>>>>>> >>>>>>>>>>> Hi Till, >>>>>>>>>>> >>>>>>>>>>> Thanks for the reply! >>>>>>>>>>> >>>>>>>>>>> I manually deploy as per-job mode [1] and I am using Flink >>>>>>>>>>> 1.8.2. Specifically, I build a custom docker image, which I copied >>>>>>>>>>> the app >>>>>>>>>>> jar (not uber jar) and all its dependencies under /flink/lib. >>>>>>>>>>> >>>>>>>>>>> So my question is more like, in this case, if the job is marked >>>>>>>>>>> as FAILED, which causes k8s to restart the pod, this seems not help >>>>>>>>>>> at all, >>>>>>>>>>> what are the suggestions for such scenario? >>>>>>>>>>> >>>>>>>>>>> Thanks a lot! >>>>>>>>>>> Eleanore >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann < >>>>>>>>>>> trohrm...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Eleanore, >>>>>>>>>>>> >>>>>>>>>>>> how are you deploying Flink exactly? Are you using the >>>>>>>>>>>> application mode with native K8s support to deploy a cluster [1] >>>>>>>>>>>> or are you >>>>>>>>>>>> manually deploying a per-job mode [2]? >>>>>>>>>>>> >>>>>>>>>>>> I believe the problem might be that we terminate the Flink >>>>>>>>>>>> process with a non-zero exit code if the job reaches the >>>>>>>>>>>> ApplicationStatus.FAILED [3]. >>>>>>>>>>>> >>>>>>>>>>>> cc Yang Wang have you observed a similar behavior when running >>>>>>>>>>>> Flink in per-job mode on K8s? >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>>>>>>>>>>> [2] >>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>>>>>>>>>>> [3] >>>>>>>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin < >>>>>>>>>>>> eleanore....@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Experts, >>>>>>>>>>>>> >>>>>>>>>>>>> I have a flink cluster (per job mode) running on kubernetes. >>>>>>>>>>>>> The job is configured with restart strategy >>>>>>>>>>>>> >>>>>>>>>>>>> restart-strategy.fixed-delay.attempts: >>>>>>>>>>>>> 3restart-strategy.fixed-delay.delay: 10 s >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> So after 3 times retry, the job will be marked as FAILED, >>>>>>>>>>>>> hence the pods are not running. However, kubernetes will then >>>>>>>>>>>>> restart the >>>>>>>>>>>>> job again as the available replicas do not match the desired one. >>>>>>>>>>>>> >>>>>>>>>>>>> I wonder what are the suggestions for such a scenario? How >>>>>>>>>>>>> should I configure the flink job running on k8s? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks a lot! >>>>>>>>>>>>> Eleanore >>>>>>>>>>>>> >>>>>>>>>>>>