Hello Flink community, We have a Flink cluster deployed to AWS EKS along with many other applications. This cluster is managed by Spotify’s Flink operator. After deployment I notice the Stateful pods of job manager and task managers intermittently received SIGTERM to terminate themselves. I assume this has something to do with the voluntary pod disruption from K8s’s descheduler, perhaps because of node draining since other applications’ pods scale up and down or other reasons. It seems like this is inevitable as K8s usually move pods here and there, however it causes the Flink job to restart every time. I feel this is quite unstable.
Has anyone also seen this voluntary pod disruption in Flink cluster at K8s? Is there any best practice or recommendation for the Flink operation in K8s? Thanks, Tianyi