Hi Eleanore,

how are you deploying Flink exactly? Are you using the application mode
with native K8s support to deploy a cluster [1] or are you manually
deploying a per-job mode [2]?

I believe the problem might be that we terminate the Flink process with a
non-zero exit code if the job reaches the ApplicationStatus.FAILED [3].

cc Yang Wang have you observed a similar behavior when running Flink in
per-job mode on K8s?

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions
[3]
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32

On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <eleanore....@gmail.com> wrote:

> Hi Experts,
>
> I have a flink cluster (per job mode) running on kubernetes. The job is
> configured with restart strategy
>
> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 
> 10 s
>
>
> So after 3 times retry, the job will be marked as FAILED, hence the pods
> are not running. However, kubernetes will then restart the job again as the
> available replicas do not match the desired one.
>
> I wonder what are the suggestions for such a scenario? How should I
> configure the flink job running on k8s?
>
> Thanks a lot!
> Eleanore
>

Reply via email to