Maximilian Michels created FLINK-36717:
------------------------------------------
Summary: Add health check to detect tasks stuck in DEPLOYING state
Key: FLINK-36717
URL: https://issues.apache.org/jira/browse/FLINK-36717
Project: Flink
Issue Type: New Feature
Components: Kubernetes Operator
Reporter: Maximilian Michels
We have an opt-in feature for monitoring Flink cluster health by the operator.
To enable it, set kubernetes.operator.cluster.health-check.enabled: true.
If enabled, the ClusterHealthObserver, triggered by the ApplicationReconciler,
collects various health-related metrics from the Flink cluster, such as the
number of restarts, the last restart timestamp, the number of completed
checkpoints, and the last completed checkpoint timestamp.
The ClusterHealthEvaluator then analyzes this information to determine whether
the Flink cluster is healthy or not.
Recently, users have reported an issue where some TaskManagers get stuck in the
task state DEPLOYING due to a faulty network connection, causing extremely slow
TCP reads while fetching the user jar from S3. Restarting the TaskManager pods
resolves this issue.
The goal of this ticket is to add a feature to the operator to automatically
restart TaskManagers which have tasks stuck in DEPLOYING state. To achieve
this, we can monitor how long tasks remain in the DEPLOYING state and decide to
restart the TaskManagers after a configured timeout. We must be careful to
ensure that we don't include jobs with large state restores, which can take a
long time. Fortunately, the task state is in INITIALIZING during state
restoration, making it easily distinguishable from DEPLOYING when we still
setup the task.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)