Yuan Huang created FLINK-33880:
-----------------------------------
Summary: Introducing Retry Mechanism for Listing TaskManager Pods
to Prevent API Server Connection Failures
Key: FLINK-33880
URL: https://issues.apache.org/jira/browse/FLINK-33880
Project: Flink
Issue Type: Improvement
Components: Deployment / Kubernetes
Affects Versions: 1.17.2
Reporter: Yuan Huang
Attachments: image-2023-12-19-18-41-41-308.png,
image-2023-12-19-18-44-13-623.png
When operating in Kubernetes mode, if the JobManager undergoes a restart, it
attempts to establish a connection with the API server to retrieve the complete
list of TaskManager Pods, facilitating the recovery of previous TaskManagers.
In the context of a large Kubernetes cluster with potentially thousands of
concurrently running jobs, a scenario may arise where all JobManagers undergo a
restart and subsequently connect to the API server (e.g., during disaster
recovery). This influx of requests may overwhelm the API server, reaching its
maximum capacity and leading to the refusal of some JobManager requests.
Consequently, certain JobManagers may experience failures and initiate
reconnection attempts to the API server.
!image-2023-12-19-18-44-13-623.png|width=505,height=206!
To enhance this process, we can propose the implementation of a retry
mechanism. In the event of a failed connection attempt to the API server, Flink
will introduce a waiting period before making subsequent connection attempts,
mitigating the risk of overwhelming the server and improving the overall
resilience of the system.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)