[
https://issues.apache.org/jira/browse/FLINK-38375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xiaolong3817 updated FLINK-38375:
---------------------------------
Description:
Hello, when using Flink on Kubernetes (K8s), we encountered an issue where
*TaskManager (TM) resources failed to be released* following an abnormal
shutdown of the JobManager (JM). The abnormal JM shutdown scenarios include
Out-of-Memory (OOM) errors, forced termination via the {{kill -9}} command, or
physical machine failures.
Currently, our Flink cluster operates in *Application Mode* on K8s. This
resource leakage issue (failure to release TM resources) can be *100%
reproduced* whenever the JM shuts down abnormally.
* Flink version in use: *1.20.2*
Our current relevant configurations are as follows:
* {{jobmanager.scheduler: Adaptive}}
* {{resourcemanager.previous-worker.recovery.timeout: 0}}
This is my analysis document:
[https://docs.google.com/document/d/1-8Iz-oOkW5YozebjqVOvmmHvVVLVnUA7qcu8GVeNnv4/edit?usp=sharing]
was:
Hello, when using Flink on Kubernetes (K8s), we encountered an issue where
*TaskManager (TM) resources failed to be released* following an abnormal
shutdown of the JobManager (JM). The abnormal JM shutdown scenarios include
{*}Out-of-Memory (OOM) errors, forced termination via the {{kill -9}} command,
or physical machine failures{*}.
Currently, our Flink cluster operates in Application Mode on K8s. This resource
leakage issue (failure to release TM resources) can be Hello, when using Flink
on Kubernetes (K8s), we encountered an issue where *TaskManager (TM) resources
failed to be released* following an abnormal shutdown of the JobManager (JM).
The abnormal JM shutdown scenarios include {*}Out-of-Memory (OOM) errors,
forced termination via the {{kill -9}} command, or physical machine failures{*}.
Currently, our Flink cluster operates in *Application Mode* on K8s. This
resource leakage issue (failure to release TM resources) can be *100%
reproduced* whenever the JM shuts down abnormally.
* Flink version in use: *1.20.2*
Our current relevant configurations are as follows:
* {{jobmanager.scheduler: Adaptive}} (Adaptive Scheduler is enabled for the
JobManager)
* {{resourcemanager.previous-worker.recovery.timeout: 0}} (The timeout for
recovering previously running workers (TMs) by the ResourceManager is set to 0,
meaning immediate timeout/no recovery) whenever the JM shuts down abnormally. *
Flink version in use: *1.20.2*
Our current relevant configurations are as follows: * {{jobmanager.scheduler:
Adaptive}} (Adaptive Scheduler is enabled for the JobManager)
* {{resourcemanager.previous-worker.recovery.timeout: 0}} (The timeout for
recovering previously running workers (TMs) by the ResourceManager is set to 0,
meaning immediate timeout/no recovery)
This is my analysis document:
https://docs.google.com/document/d/1-8Iz-oOkW5YozebjqVOvmmHvVVLVnUA7qcu8GVeNnv4/edit?usp=sharing
> Flink Resouce leak on Flink k8s
> -------------------------------
>
> Key: FLINK-38375
> URL: https://issues.apache.org/jira/browse/FLINK-38375
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.20.2
> Reporter: xiaolong3817
> Priority: Major
>
> Hello, when using Flink on Kubernetes (K8s), we encountered an issue where
> *TaskManager (TM) resources failed to be released* following an abnormal
> shutdown of the JobManager (JM). The abnormal JM shutdown scenarios include
> Out-of-Memory (OOM) errors, forced termination via the {{kill -9}} command,
> or physical machine failures.
> Currently, our Flink cluster operates in *Application Mode* on K8s. This
> resource leakage issue (failure to release TM resources) can be *100%
> reproduced* whenever the JM shuts down abnormally.
> * Flink version in use: *1.20.2*
> Our current relevant configurations are as follows:
> * {{jobmanager.scheduler: Adaptive}}
> * {{resourcemanager.previous-worker.recovery.timeout: 0}}
> This is my analysis document:
> [https://docs.google.com/document/d/1-8Iz-oOkW5YozebjqVOvmmHvVVLVnUA7qcu8GVeNnv4/edit?usp=sharing]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)