[jira] [Updated] (FLINK-38375) Flink Resouce leak on Flink k8s

xiaolong3817 (Jira) Wed, 17 Sep 2025 22:58:06 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-38375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


xiaolong3817 updated FLINK-38375:
---------------------------------
    Description: 
Hello, when using Flink on Kubernetes (K8s), we encountered an issue where 
*TaskManager (TM) resources failed to be released* following an abnormal 
shutdown of the JobManager (JM). The abnormal JM shutdown scenarios include 
Out-of-Memory (OOM) errors, forced termination via the {{kill -9}} command, or 
physical machine failures.

Currently, our Flink cluster operates in *Application Mode* on K8s. This 
resource leakage issue (failure to release TM resources) can be *100% 
reproduced* whenever the JM shuts down abnormally.
 * Flink version in use: *1.20.2*

Our current relevant configurations are as follows:
 * {{jobmanager.scheduler: Adaptive}}
 * {{resourcemanager.previous-worker.recovery.timeout: 0}} 

This is my analysis document：

[https://docs.google.com/document/d/1-8Iz-oOkW5YozebjqVOvmmHvVVLVnUA7qcu8GVeNnv4/edit?usp=sharing]

  was:
Hello, when using Flink on Kubernetes (K8s), we encountered an issue where 
*TaskManager (TM) resources failed to be released* following an abnormal 
shutdown of the JobManager (JM). The abnormal JM shutdown scenarios include 
{*}Out-of-Memory (OOM) errors, forced termination via the {{kill -9}} command, 
or physical machine failures{*}.
 
Currently, our Flink cluster operates in Application Mode on K8s. This resource 
leakage issue (failure to release TM resources) can be Hello, when using Flink 
on Kubernetes (K8s), we encountered an issue where *TaskManager (TM) resources 
failed to be released* following an abnormal shutdown of the JobManager (JM). 
The abnormal JM shutdown scenarios include {*}Out-of-Memory (OOM) errors, 
forced termination via the {{kill -9}} command, or physical machine failures{*}.

Currently, our Flink cluster operates in *Application Mode* on K8s. This 
resource leakage issue (failure to release TM resources) can be *100% 
reproduced* whenever the JM shuts down abnormally.
 * Flink version in use: *1.20.2*

Our current relevant configurations are as follows:
 * {{jobmanager.scheduler: Adaptive}} (Adaptive Scheduler is enabled for the 
JobManager)
 * {{resourcemanager.previous-worker.recovery.timeout: 0}} (The timeout for 
recovering previously running workers (TMs) by the ResourceManager is set to 0, 
meaning immediate timeout/no recovery) whenever the JM shuts down abnormally. * 
Flink version in use: *1.20.2*

Our current relevant configurations are as follows: * {{jobmanager.scheduler: 
Adaptive}} (Adaptive Scheduler is enabled for the JobManager)
 * {{resourcemanager.previous-worker.recovery.timeout: 0}} (The timeout for 
recovering previously running workers (TMs) by the ResourceManager is set to 0, 
meaning immediate timeout/no recovery)


This is my analysis document：

https://docs.google.com/document/d/1-8Iz-oOkW5YozebjqVOvmmHvVVLVnUA7qcu8GVeNnv4/edit?usp=sharing


> Flink Resouce leak on Flink k8s
> -------------------------------
>
>                 Key: FLINK-38375
>                 URL: https://issues.apache.org/jira/browse/FLINK-38375
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.20.2
>            Reporter: xiaolong3817
>            Priority: Major
>
> Hello, when using Flink on Kubernetes (K8s), we encountered an issue where 
> *TaskManager (TM) resources failed to be released* following an abnormal 
> shutdown of the JobManager (JM). The abnormal JM shutdown scenarios include 
> Out-of-Memory (OOM) errors, forced termination via the {{kill -9}} command, 
> or physical machine failures.
> Currently, our Flink cluster operates in *Application Mode* on K8s. This 
> resource leakage issue (failure to release TM resources) can be *100% 
> reproduced* whenever the JM shuts down abnormally.
>  * Flink version in use: *1.20.2*
> Our current relevant configurations are as follows:
>  * {{jobmanager.scheduler: Adaptive}}
>  * {{resourcemanager.previous-worker.recovery.timeout: 0}} 
> This is my analysis document：
> [https://docs.google.com/document/d/1-8Iz-oOkW5YozebjqVOvmmHvVVLVnUA7qcu8GVeNnv4/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-38375) Flink Resouce leak on Flink k8s

Reply via email to