Canbin Zheng created FLINK-17176:
------------------------------------
Summary: Slow down Pod recreation in
KubernetesResourceManager#PodCallbackHandler
Key: FLINK-17176
URL: https://issues.apache.org/jira/browse/FLINK-17176
Project: Flink
Issue Type: Improvement
Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
Fix For: 1.11.0
In the native K8s setups, there are some cases that we do not control the speed
of pod re-creation which poses potential risks to flood the K8s API Server in
the {{PodCallbackHandler}} implementation of {{KubernetesResourceManager.}}
Here are steps to reproduce this kind of problems:
# Mount theĀ {{/opt/flink/log}} in the Container of TaskManager to a path on
the K8s nodes via HostPath, make sure that the path exists but the TaskManager
process has no write permission. We can achieve this via the user-specified pod
template support or just hardcode it for testing only.
# Launch a session cluster
# Submit a new job to the session cluster, as expected, we can observe that
the Pod constantly fails quickly during launching the main Container, then theĀ
{{KubernetesResourceManager#onModified}} is invoked to re-create a new Pod
immediately, without any speed control.
To sum up, once the {{KubernetesResourceManager}} receives the Pod *ADD* event
and that Pod is terminated before successfully registering into the
{{KubernetesResourceManager}}, the {{KubernetesResourceManager}} will send
another creation request to K8s API Server immediately.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)