Hi Tianyi,

this really depends on your kubernetes setup (eg. if autoscaling is
enabled, you're using spot / preemtible instances). In general applications
that run on Kubernetes needs be resilient to these kind of failures, Flink
is no exception.

In case of the failure, Flink needs to restart the job from the latest
checkpoint to ensure consistency. In this kind of environment, you should
be OK-ish with replying one checkpoint worth of data (you're able to adjust
the checkpointing interval).

Still it would be worth looking into why this disruptions happen and fix
the cause. Even though you should be able to recover from these types of
failures, doesn't mean that it's a good thing to do that more often then
necessary :) I think if you describe the pod / sts you should see the k8s
events that resulted in the container being terminated.

Also we're currently working on sever efforts to make the restarting
experience smoother and checkpointing interval shorter (eg. FLIP-198 [1],
FLINK-25277 [2], FLIP-158 [3], ..).

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-198%3A+Working+directory+for+Flink+processes
[2] https://issues.apache.org/jira/browse/FLINK-25277
[3]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints

Best,
D.

On Tue, Jan 4, 2022 at 7:23 PM Tianyi Deng <td...@blizzard.com> wrote:

> Hello Flink community,
>
>
>
> We have a Flink cluster deployed to AWS EKS along with many other
> applications. This cluster is managed by Spotify’s Flink operator. After
> deployment I notice the Stateful pods of job manager and task managers
> intermittently received *SIGTERM* to terminate themselves. I assume this
> has something to do with the voluntary pod disruption from K8s’s
> descheduler, perhaps because of node draining since other applications’
> pods scale up and down or other reasons. It seems like this is inevitable
> as K8s usually move pods here and there, however it causes the Flink job to
> restart every time. I feel this is quite unstable.
>
>
>
> Has anyone also seen this voluntary pod disruption in Flink cluster at
> K8s? Is there any best practice or recommendation for the Flink operation
> in K8s?
>
>
>
> Thanks,
>
> Tianyi
>

Reply via email to