[ 
https://issues.apache.org/jira/browse/FLINK-38821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-38821:
-----------------------------------
    Labels: pull-request-available  (was: )

> Add TaskManager creation/deletion metrics to Active Resource Manager
> --------------------------------------------------------------------
>
>                 Key: FLINK-38821
>                 URL: https://issues.apache.org/jira/browse/FLINK-38821
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Metrics
>         Environment: flink 1.15
> java 8
> k8s 1.30.12
>            Reporter: moonyoung
>            Priority: Minor
>              Labels: pull-request-available
>
> h2. Background / Problem Description
> We observed an issue in a Kubernetes-based Flink deployment where 
> {*}TaskManagers are repeatedly created and deleted in a short period of 
> time{*}.
> This behavior appears as a rapid loop of TaskManager pod creation and 
> termination. Based on our investigation, this is likely caused by a 
> Kubernetes-related issue rather than Flink application logic.
>  
> During this incident, TaskManagers failed to register with the JobManager and 
> exited early with the following error:
> The environment variable _POD_NODE_ID is not set, which is used to identify 
> the node where the task manager is located.
> This error indicates that the TaskManager could not obtain the node name from 
> the Kubernetes Downward API.
> As a result:
>  * The TaskManager process terminates during initialization
>  * The TaskManager never registers with the JobManager
>  * The Active Resource Manager considers the worker as “pending but 
> unregistered”
> This strongly suggests a transient or intermittent Kubernetes issue (e.g., 
> Downward API failure, pod startup race, or node-related instability).
>  
> h2. Suggest
> For now ActiveResourceManager only contains {{pendingWorkerCounter}} and 
> {{totalWorkerCounter}} are {_}gauge-like state metrics{_}, they do not 
> reflect:
>  * How frequently TaskManagers are being created
>  * How frequently TaskManagers are being removed or failing before 
> registration
> In high-churn scenarios, this makes it difficult to:
>  * Detect instability caused by Kubernetes
>  * Distinguish between “healthy but busy” and “unhealthy and flapping” 
> clusters
>  * Perform root cause analysis based on metrics alone
> So I propose adding *counter-based metrics* to the Active Resource Manager, 
> such as:
>  * *TaskManager creation counter*
>  * 
>  ** Incremented whenever a TaskManager is requested / launched
>  * *TaskManager removal (or failure) counter*
>  * 
>  ** Incremented whenever a pending or running TaskManager is removed before 
> normal shutdown
> These metrics would complement existing counters and provide *temporal 
> visibility* into TaskManager churn.
> If adding these metrics are reasonable, I'd love to add this metric. Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to