Hello Flink community,

We have a Flink cluster deployed to AWS EKS along with many other applications. 
This cluster is managed by Spotify’s Flink operator. After deployment I notice 
the Stateful pods of job manager and task managers intermittently received 
SIGTERM to terminate themselves. I assume this has something to do with the 
voluntary pod disruption from K8s’s descheduler, perhaps because of node 
draining since other applications’ pods scale up and down or other reasons. It 
seems like this is inevitable as K8s usually move pods here and there, however 
it causes the Flink job to restart every time. I feel this is quite unstable.

Has anyone also seen this voluntary pod disruption in Flink cluster at K8s? Is 
there any best practice or recommendation for the Flink operation in K8s?

Thanks,
Tianyi

Reply via email to