[
https://issues.apache.org/jira/browse/FLINK-38821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-38821:
-----------------------------------
Labels: pull-request-available (was: )
> Add TaskManager creation/deletion metrics to Active Resource Manager
> --------------------------------------------------------------------
>
> Key: FLINK-38821
> URL: https://issues.apache.org/jira/browse/FLINK-38821
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Metrics
> Environment: flink 1.15
> java 8
> k8s 1.30.12
> Reporter: moonyoung
> Priority: Minor
> Labels: pull-request-available
>
> h2. Background / Problem Description
> We observed an issue in a Kubernetes-based Flink deployment where
> {*}TaskManagers are repeatedly created and deleted in a short period of
> time{*}.
> This behavior appears as a rapid loop of TaskManager pod creation and
> termination. Based on our investigation, this is likely caused by a
> Kubernetes-related issue rather than Flink application logic.
>
> During this incident, TaskManagers failed to register with the JobManager and
> exited early with the following error:
> The environment variable _POD_NODE_ID is not set, which is used to identify
> the node where the task manager is located.
> This error indicates that the TaskManager could not obtain the node name from
> the Kubernetes Downward API.
> As a result:
> * The TaskManager process terminates during initialization
> * The TaskManager never registers with the JobManager
> * The Active Resource Manager considers the worker as “pending but
> unregistered”
> This strongly suggests a transient or intermittent Kubernetes issue (e.g.,
> Downward API failure, pod startup race, or node-related instability).
>
> h2. Suggest
> For now ActiveResourceManager only contains {{pendingWorkerCounter}} and
> {{totalWorkerCounter}} are {_}gauge-like state metrics{_}, they do not
> reflect:
> * How frequently TaskManagers are being created
> * How frequently TaskManagers are being removed or failing before
> registration
> In high-churn scenarios, this makes it difficult to:
> * Detect instability caused by Kubernetes
> * Distinguish between “healthy but busy” and “unhealthy and flapping”
> clusters
> * Perform root cause analysis based on metrics alone
> So I propose adding *counter-based metrics* to the Active Resource Manager,
> such as:
> * *TaskManager creation counter*
> *
> ** Incremented whenever a TaskManager is requested / launched
> * *TaskManager removal (or failure) counter*
> *
> ** Incremented whenever a pending or running TaskManager is removed before
> normal shutdown
> These metrics would complement existing counters and provide *temporal
> visibility* into TaskManager churn.
> If adding these metrics are reasonable, I'd love to add this metric. Thanks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)