Hi Eleanore, how are you deploying Flink exactly? Are you using the application mode with native K8s support to deploy a cluster [1] or are you manually deploying a per-job mode [2]?
I believe the problem might be that we terminate the Flink process with a non-zero exit code if the job reaches the ApplicationStatus.FAILED [3]. cc Yang Wang have you observed a similar behavior when running Flink in per-job mode on K8s? [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application [2] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions [3] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <eleanore....@gmail.com> wrote: > Hi Experts, > > I have a flink cluster (per job mode) running on kubernetes. The job is > configured with restart strategy > > restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: > 10 s > > > So after 3 times retry, the job will be marked as FAILED, hence the pods > are not running. However, kubernetes will then restart the job again as the > available replicas do not match the desired one. > > I wonder what are the suggestions for such a scenario? How should I > configure the flink job running on k8s? > > Thanks a lot! > Eleanore >