[
https://issues.apache.org/jira/browse/FLINK-38375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xiaolong3817 updated FLINK-38375:
---------------------------------
Affects Version/s: 1.20.2
Description:
Hello, when using Flink on Kubernetes (K8s), we encountered an issue where
*TaskManager (TM) resources failed to be released* following an abnormal
shutdown of the JobManager (JM). The abnormal JM shutdown scenarios include
{*}Out-of-Memory (OOM) errors, forced termination via the {{kill -9}} command,
or physical machine failures{*}.
Currently, our Flink cluster operates in Application Mode on K8s. This resource
leakage issue (failure to release TM resources) can be Hello, when using Flink
on Kubernetes (K8s), we encountered an issue where *TaskManager (TM) resources
failed to be released* following an abnormal shutdown of the JobManager (JM).
The abnormal JM shutdown scenarios include {*}Out-of-Memory (OOM) errors,
forced termination via the {{kill -9}} command, or physical machine failures{*}.
Currently, our Flink cluster operates in *Application Mode* on K8s. This
resource leakage issue (failure to release TM resources) can be *100%
reproduced* whenever the JM shuts down abnormally.
* Flink version in use: *1.20.2*
Our current relevant configurations are as follows:
* {{jobmanager.scheduler: Adaptive}} (Adaptive Scheduler is enabled for the
JobManager)
* {{resourcemanager.previous-worker.recovery.timeout: 0}} (The timeout for
recovering previously running workers (TMs) by the ResourceManager is set to 0,
meaning immediate timeout/no recovery) whenever the JM shuts down abnormally. *
Flink version in use: *1.20.2*
Our current relevant configurations are as follows: * {{jobmanager.scheduler:
Adaptive}} (Adaptive Scheduler is enabled for the JobManager)
* {{resourcemanager.previous-worker.recovery.timeout: 0}} (The timeout for
recovering previously running workers (TMs) by the ResourceManager is set to 0,
meaning immediate timeout/no recovery)
> Flink Resouce leak on Flink k8s
> -------------------------------
>
> Key: FLINK-38375
> URL: https://issues.apache.org/jira/browse/FLINK-38375
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.20.2
> Reporter: xiaolong3817
> Priority: Major
>
> Hello, when using Flink on Kubernetes (K8s), we encountered an issue where
> *TaskManager (TM) resources failed to be released* following an abnormal
> shutdown of the JobManager (JM). The abnormal JM shutdown scenarios include
> {*}Out-of-Memory (OOM) errors, forced termination via the {{kill -9}}
> command, or physical machine failures{*}.
>
> Currently, our Flink cluster operates in Application Mode on K8s. This
> resource leakage issue (failure to release TM resources) can be Hello, when
> using Flink on Kubernetes (K8s), we encountered an issue where *TaskManager
> (TM) resources failed to be released* following an abnormal shutdown of the
> JobManager (JM). The abnormal JM shutdown scenarios include {*}Out-of-Memory
> (OOM) errors, forced termination via the {{kill -9}} command, or physical
> machine failures{*}.
> Currently, our Flink cluster operates in *Application Mode* on K8s. This
> resource leakage issue (failure to release TM resources) can be *100%
> reproduced* whenever the JM shuts down abnormally.
> * Flink version in use: *1.20.2*
> Our current relevant configurations are as follows:
> * {{jobmanager.scheduler: Adaptive}} (Adaptive Scheduler is enabled for the
> JobManager)
> * {{resourcemanager.previous-worker.recovery.timeout: 0}} (The timeout for
> recovering previously running workers (TMs) by the ResourceManager is set to
> 0, meaning immediate timeout/no recovery) whenever the JM shuts down
> abnormally. * Flink version in use: *1.20.2*
> Our current relevant configurations are as follows: * {{jobmanager.scheduler:
> Adaptive}} (Adaptive Scheduler is enabled for the JobManager)
> * {{resourcemanager.previous-worker.recovery.timeout: 0}} (The timeout for
> recovering previously running workers (TMs) by the ResourceManager is set to
> 0, meaning immediate timeout/no recovery)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)