@Yang Wang <danrtsey...@gmail.com> I believe that we should rethink the exit codes of Flink. In general you want K8s to restart a failed Flink process. Hence, an application which terminates in state FAILED should not return a non-zero exit code because it is a valid termination state.
Cheers, Till On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <danrtsey...@gmail.com> wrote: > Hi Eleanore, > > I think you are using K8s resource "Job" to deploy the jobmanager. Please > set .spec.template.spec.restartPolicy = "Never" and spec.backoffLimit = 0. > Refer here[1] for more information. > > Then, when the jobmanager failed because of any reason, the K8s job will > be marked failed. And K8s will not restart the job again. > > [1]. > https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup > > > Best, > Yang > > Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 上午12:05写道: > >> Hi Till, >> >> Thanks for the reply! >> >> I manually deploy as per-job mode [1] and I am using Flink 1.8.2. >> Specifically, I build a custom docker image, which I copied the app jar >> (not uber jar) and all its dependencies under /flink/lib. >> >> So my question is more like, in this case, if the job is marked as >> FAILED, which causes k8s to restart the pod, this seems not help at all, >> what are the suggestions for such scenario? >> >> Thanks a lot! >> Eleanore >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >> >> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> Hi Eleanore, >>> >>> how are you deploying Flink exactly? Are you using the application mode >>> with native K8s support to deploy a cluster [1] or are you manually >>> deploying a per-job mode [2]? >>> >>> I believe the problem might be that we terminate the Flink process with >>> a non-zero exit code if the job reaches the ApplicationStatus.FAILED [3]. >>> >>> cc Yang Wang have you observed a similar behavior when running Flink in >>> per-job mode on K8s? >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>> [2] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>> [3] >>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>> >>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <eleanore....@gmail.com> >>> wrote: >>> >>>> Hi Experts, >>>> >>>> I have a flink cluster (per job mode) running on kubernetes. The job is >>>> configured with restart strategy >>>> >>>> restart-strategy.fixed-delay.attempts: >>>> 3restart-strategy.fixed-delay.delay: 10 s >>>> >>>> >>>> So after 3 times retry, the job will be marked as FAILED, hence the >>>> pods are not running. However, kubernetes will then restart the job again >>>> as the available replicas do not match the desired one. >>>> >>>> I wonder what are the suggestions for such a scenario? How should I >>>> configure the flink job running on k8s? >>>> >>>> Thanks a lot! >>>> Eleanore >>>> >>>