Hi Eleanore, Yes, I suggest to use Job to replace Deployment. It could be used to run jobmanager one time and finish after a successful/failed completion.
However, using Job still could not solve your problem completely. Just as Till said, When a job exhausts the restart strategy, the jobmanager pod will terminate with non-zero exit code. It will cause the K8s restarting it again. Even though we could set the resartPolicy and backoffLimit, this is not a clean and correct way to go. We should terminate the jobmanager process with zero exit code in such situation. @Till Rohrmann <trohrm...@apache.org> I just have one concern. Is it a special case for K8s deployment? For standalone/Yarn/Mesos, it seems that terminating with non-zero exit code is harmless. Best, Yang Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 下午11:54写道: > Hi Yang & Till, > > Thanks for your prompt reply! > > Yang, regarding your question, I am actually not using k8s job, as I put > my app.jar and its dependencies under flink's lib directory. I have 1 k8s > deployment for job manager, and 1 k8s deployment for task manager, and 1 > k8s service for job manager. > > As you mentioned above, if flink job is marked as failed, it will cause > the job manager pod to be restarted. Which is not the ideal behavior. > > Do you suggest that I should change the deployment strategy from using k8s > deployment to k8s job? In case the flink program exit with non-zero code > (e.g. exhausted number of configured restart), pod can be marked as > complete hence not restarting the job again? > > Thanks a lot! > Eleanore > > On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <danrtsey...@gmail.com> wrote: > >> @Till Rohrmann <trohrm...@apache.org> In native mode, when a Flink >> application terminates with FAILED state, all the resources will be cleaned >> up. >> >> However, in standalone mode, I agree with you that we need to rethink the >> exit code of Flink. When a job exhausts the restart >> strategy, we should terminate the pod and do not restart again. After >> googling, it seems that we could not specify the restartPolicy >> based on exit code[1]. So maybe we need to return a zero exit code to >> avoid restarting by K8s. >> >> [1]. >> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code >> >> Best, >> Yang >> >> Till Rohrmann <trohrm...@apache.org> 于2020年8月4日周二 下午3:48写道: >> >>> @Yang Wang <danrtsey...@gmail.com> I believe that we should rethink the >>> exit codes of Flink. In general you want K8s to restart a failed Flink >>> process. Hence, an application which terminates in state FAILED should not >>> return a non-zero exit code because it is a valid termination state. >>> >>> Cheers, >>> Till >>> >>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <danrtsey...@gmail.com> wrote: >>> >>>> Hi Eleanore, >>>> >>>> I think you are using K8s resource "Job" to deploy the jobmanager. >>>> Please set .spec.template.spec.restartPolicy = "Never" and >>>> spec.backoffLimit = 0. >>>> Refer here[1] for more information. >>>> >>>> Then, when the jobmanager failed because of any reason, the K8s job >>>> will be marked failed. And K8s will not restart the job again. >>>> >>>> [1]. >>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup >>>> >>>> >>>> Best, >>>> Yang >>>> >>>> Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 上午12:05写道: >>>> >>>>> Hi Till, >>>>> >>>>> Thanks for the reply! >>>>> >>>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2. >>>>> Specifically, I build a custom docker image, which I copied the app jar >>>>> (not uber jar) and all its dependencies under /flink/lib. >>>>> >>>>> So my question is more like, in this case, if the job is marked as >>>>> FAILED, which causes k8s to restart the pod, this seems not help at all, >>>>> what are the suggestions for such scenario? >>>>> >>>>> Thanks a lot! >>>>> Eleanore >>>>> >>>>> [1] >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >>>>> >>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <trohrm...@apache.org> >>>>> wrote: >>>>> >>>>>> Hi Eleanore, >>>>>> >>>>>> how are you deploying Flink exactly? Are you using the application >>>>>> mode with native K8s support to deploy a cluster [1] or are you manually >>>>>> deploying a per-job mode [2]? >>>>>> >>>>>> I believe the problem might be that we terminate the Flink process >>>>>> with a non-zero exit code if the job reaches the ApplicationStatus.FAILED >>>>>> [3]. >>>>>> >>>>>> cc Yang Wang have you observed a similar behavior when running Flink >>>>>> in per-job mode on K8s? >>>>>> >>>>>> [1] >>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>>>>> [2] >>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>>>>> [3] >>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>>>>> >>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <eleanore....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Experts, >>>>>>> >>>>>>> I have a flink cluster (per job mode) running on kubernetes. The job >>>>>>> is configured with restart strategy >>>>>>> >>>>>>> restart-strategy.fixed-delay.attempts: >>>>>>> 3restart-strategy.fixed-delay.delay: 10 s >>>>>>> >>>>>>> >>>>>>> So after 3 times retry, the job will be marked as FAILED, hence the >>>>>>> pods are not running. However, kubernetes will then restart the job >>>>>>> again >>>>>>> as the available replicas do not match the desired one. >>>>>>> >>>>>>> I wonder what are the suggestions for such a scenario? How should I >>>>>>> configure the flink job running on k8s? >>>>>>> >>>>>>> Thanks a lot! >>>>>>> Eleanore >>>>>>> >>>>>>