[ 
https://issues.apache.org/jira/browse/FLINK-38252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baozhu Zhao updated FLINK-38252:
--------------------------------
    Description: 
Our Flink job is deployed on k8s.

 

The SRE of the k8s cluster periodically cleans up pending pods, but Flink does 
not handle the delete pending pod event, resulting in Flink jobs never applying 
for new pods and ultimately failing due to insufficient resources.

 

This problem can be replicated using a small k8s cluster.
For example, if the k8s cluster only has a total of 10 core CPUs, Flink job 
configuration requests four 5-core pods, and actively deletes the pending pods 
before the job resource request timeout, the ResourceManager will not apply for 
new pods.

  was:
Our Flink job is deployed on k8s.

The SRE of the k8s cluster periodically cleans up pending pods, but Flink does 
not handle the delete pending pod event, resulting in Flink jobs never applying 
for new pods and ultimately failing due to insufficient resources.

    Environment:     (was: flink version : 1.17

This problem can be replicated using a small k8s cluster.
For example, if the k8s cluster only has a total of 10 core CPUs, Flink job 
configuration requests four 5-core pods, and actively deletes the pending pods 
before the job resource request timeout, the ResourceManager will not apply for 
new pods.)

> ResourceManager will not apply for a new pod when pending pod is deleted
> ------------------------------------------------------------------------
>
>                 Key: FLINK-38252
>                 URL: https://issues.apache.org/jira/browse/FLINK-38252
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.17.2, 1.19.3, 2.1.0
>            Reporter: Baozhu Zhao
>            Priority: Minor
>
> Our Flink job is deployed on k8s.
>  
> The SRE of the k8s cluster periodically cleans up pending pods, but Flink 
> does not handle the delete pending pod event, resulting in Flink jobs never 
> applying for new pods and ultimately failing due to insufficient resources.
>  
> This problem can be replicated using a small k8s cluster.
> For example, if the k8s cluster only has a total of 10 core CPUs, Flink job 
> configuration requests four 5-core pods, and actively deletes the pending 
> pods before the job resource request timeout, the ResourceManager will not 
> apply for new pods.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to