[ 
https://issues.apache.org/jira/browse/FLINK-37773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nihar Rao updated FLINK-37773:
------------------------------
    Description: 
Hi,

We are running into a weird issue with apache flink kubernetes operator 1.10.0 
and apache flink 1.19.1. We are running jobs using native kubernetes 
application mode and FlinkDeployment CRD. We are running a job with 24 
taskmanagers and 1 Jobmanager replica with HA enabled.

Below is the chronological summary of events:

1. Job was initially started with 24 task managers.

2. JM pod was OOMkilled and it is confirmed by our KSM metrics and {{kubectl 
describe pod <JM pod>}} shows the pod restarted due to OOM as well.

3. After JM was OOMkilled, JM was restarted and 24 new taskmanagers pods were 
started and is confirmed on flink UI on available task slots section.

4. There was no impact on job (it restarted successfully) but there are 48 
taskmanagers running out of which 24 are standby. The expected behaviour after 
a JM OOM with HA enabled is no starting of new task managers.

5. I have added the flink UI showing 24 extra TMs (48 task slots) screenshot 
and kubectl output below.

I also checked the kubernetes operator pod logs and I don't find anything that 
could explain this behaviour. This has happened few times now with different 
jobs and we have tried purposely OOMkilling jobmanager for one of our test jobs 
many times we haven't been aple to reproduce this behaviour. It looks to be an 
edge case which is difficult to reproduce.

Can you please help us on how to debug this as kubernetes operator don't show 
any relevant information on why this happened. Thanks and let me know if you 
need further information. 

kubectl get pod ouput showing 24 extra TMs:
NAME                                      READY   STATUS    RESTARTS      AGE
ioi-quality-667f575877-btfkv              1/1     Running   1 (39h ago)   4d16h
ioi-quality-taskmanager-1-1               1/1     Running   0             4d16h
ioi-quality-taskmanager-1-10              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-11              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-12              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-13              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-14              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-15              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-16              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-17              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-18              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-19              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-2               1/1     Running   0             4d16h
ioi-quality-taskmanager-1-20              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-21              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-22              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-23              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-24              1/1     Running   0             4d16h
ioi-quality-taskmanager-1-3               1/1     Running   0             4d16h
ioi-quality-taskmanager-1-4               1/1     Running   0             4d16h
ioi-quality-taskmanager-1-5               1/1     Running   0             4d16h
ioi-quality-taskmanager-1-6               1/1     Running   0             4d16h
ioi-quality-taskmanager-1-7               1/1     Running   0             4d16h
ioi-quality-taskmanager-1-8               1/1     Running   0             4d16h
ioi-quality-taskmanager-1-9               1/1     Running   0             4d16h
ioi-quality-taskmanager-2-1               1/1     Running   0             39h
ioi-quality-taskmanager-2-10              1/1     Running   0             39h
ioi-quality-taskmanager-2-11              1/1     Running   0             39h
ioi-quality-taskmanager-2-12              1/1     Running   0             39h
ioi-quality-taskmanager-2-13              1/1     Running   0             39h
ioi-quality-taskmanager-2-14              1/1     Running   0             39h
ioi-quality-taskmanager-2-15              1/1     Running   0             39h
ioi-quality-taskmanager-2-16              1/1     Running   0             39h
ioi-quality-taskmanager-2-17              1/1     Running   0             39h
ioi-quality-taskmanager-2-18              1/1     Running   0             39h
ioi-quality-taskmanager-2-19              1/1     Running   0             39h
ioi-quality-taskmanager-2-2               1/1     Running   0             39h
ioi-quality-taskmanager-2-20              1/1     Running   0             39h
ioi-quality-taskmanager-2-21              1/1     Running   0             39h
ioi-quality-taskmanager-2-22              1/1     Running   0             39h
ioi-quality-taskmanager-2-23              1/1     Running   0             39h
ioi-quality-taskmanager-2-24              1/1     Running   0             39h
ioi-quality-taskmanager-2-3               1/1     Running   0             39h
ioi-quality-taskmanager-2-4               1/1     Running   0             39h
ioi-quality-taskmanager-2-5               1/1     Running   0             39h
ioi-quality-taskmanager-2-6               1/1     Running   0             39h
ioi-quality-taskmanager-2-7               1/1     Running   0             39h
ioi-quality-taskmanager-2-8               1/1     Running   0             39h
ioi-quality-taskmanager-2-9               1/1     Running   0             39h
 
 

  was:
Hi,

We are running into a weird issue with apache flink kubernetes operator 1.10.0 
and apache flink 1.19.1. We are running jobs using native kubernetes 
application mode and FlinkDeployment CRD. We are running a job with 24 
taskmanagers and 1 Jobmanager replica with HA enabled.

Below is the chronological summary of events:


1. Job was initially started with 24 task managers.

2. JM pod was OOMkilled and it is confirmed by our KSM metrics and {{kubectl 
describe pod <JM pod>}} shows the pod restarted due to OOM as well.

3. After JM was OOMkilled, JM was restarted and 24 new taskmanagers pods were 
started and is confirmed on flink UI on available task slots section.

4. There was no impact on job (it restarted successfully) but there are 48 
taskmanagers running out of which 24 are standby. The expected behaviour after 
a JM OOM with HA enabled is no starting of new task managers.

5. I have added the flink UI showing 24 extra TMs (48 task slots) screenshot 
and kubectl output below.

I also checked the kubernetes operator pod logs and I don't find anything that 
could explain this behaviour. This has happened few times now with different 
jobs and we have tried purposely OOMkilling jobmanager for one of our test jobs 
many times we haven't been aple to reproduce this behaviour. It looks to be an 
edge case which is difficult to reproduce.

Can you please help us on how to debug this as kubernetes operator don't show 
any relevant information on why this happened. Thanks and let me know if you 
need further information. 




kubectl get pod ouput showing 24 extra TMs:
oi-quality-667f575877-btfkv 1/1 Running 1 (39h ago) 4d16h 
ioi-quality-taskmanager-1-1 1/1 Running 0 4d16h ioi-quality-taskmanager-1-10 
1/1 Running 0 4d16h ioi-quality-taskmanager-1-11 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-12 1/1 Running 0 4d16h ioi-quality-taskmanager-1-13 
1/1 Running 0 4d16h ioi-quality-taskmanager-1-14 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-15 1/1 Running 0 4d16h ioi-quality-taskmanager-1-16 
1/1 Running 0 4d16h ioi-quality-taskmanager-1-17 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-18 1/1 Running 0 4d16h ioi-quality-taskmanager-1-19 
1/1 Running 0 4d16h ioi-quality-taskmanager-1-2 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-20 1/1 Running 0 4d16h ioi-quality-taskmanager-1-21 
1/1 Running 0 4d16h ioi-quality-taskmanager-1-22 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-23 1/1 Running 0 4d16h ioi-quality-taskmanager-1-24 
1/1 Running 0 4d16h ioi-quality-taskmanager-1-3 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-4 1/1 Running 0 4d16h ioi-quality-taskmanager-1-5 1/1 
Running 0 4d16h ioi-quality-taskmanager-1-6 1/1 Running 0 4d16h 
ioi-quality-taskmanager-1-7 1/1 Running 0 4d16h ioi-quality-taskmanager-1-8 1/1 
Running 0 4d16h ioi-quality-taskmanager-1-9 1/1 Running 0 4d16h 
ioi-quality-taskmanager-2-1 1/1 Running 0 39h ioi-quality-taskmanager-2-10 1/1 
Running 0 39h ioi-quality-taskmanager-2-11 1/1 Running 0 39h 
ioi-quality-taskmanager-2-12 1/1 Running 0 39h ioi-quality-taskmanager-2-13 1/1 
Running 0 39h ioi-quality-taskmanager-2-14 1/1 Running 0 39h 
ioi-quality-taskmanager-2-15 1/1 Running 0 39h ioi-quality-taskmanager-2-16 1/1 
Running 0 39h ioi-quality-taskmanager-2-17 1/1 Running 0 39h 
ioi-quality-taskmanager-2-18 1/1 Running 0 39h ioi-quality-taskmanager-2-19 1/1 
Running 0 39h ioi-quality-taskmanager-2-2 1/1 Running 0 39h 
ioi-quality-taskmanager-2-20 1/1 Running 0 39h ioi-quality-taskmanager-2-21 1/1 
Running 0 39h ioi-quality-taskmanager-2-22 1/1 Running 0 39h 
ioi-quality-taskmanager-2-23 1/1 Running 0 39h ioi-quality-taskmanager-2-24 1/1 
Running 0 39h ioi-quality-taskmanager-2-3 1/1 Running 0 39h 
ioi-quality-taskmanager-2-4 1/1 Running 0 39h ioi-quality-taskmanager-2-5 1/1 
Running 0 39h ioi-quality-taskmanager-2-6 1/1 Running 0 39h 
ioi-quality-taskmanager-2-7 1/1 Running 0 39h ioi-quality-taskmanager-2-8 1/1 
Running 0 39h ioi-quality-taskmanager-2-9 1/1 Running 0 39h
 
 


> Extra TMs are started when Jobmanger is OOM killed in some FlinkDeployment 
> runs
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-37773
>                 URL: https://issues.apache.org/jira/browse/FLINK-37773
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.10.0
>            Reporter: Nihar Rao
>            Priority: Major
>         Attachments: Screenshot 2025-04-24 at 3.29.47 PM.png
>
>
> Hi,
> We are running into a weird issue with apache flink kubernetes operator 
> 1.10.0 and apache flink 1.19.1. We are running jobs using native kubernetes 
> application mode and FlinkDeployment CRD. We are running a job with 24 
> taskmanagers and 1 Jobmanager replica with HA enabled.
> Below is the chronological summary of events:
> 1. Job was initially started with 24 task managers.
> 2. JM pod was OOMkilled and it is confirmed by our KSM metrics and {{kubectl 
> describe pod <JM pod>}} shows the pod restarted due to OOM as well.
> 3. After JM was OOMkilled, JM was restarted and 24 new taskmanagers pods were 
> started and is confirmed on flink UI on available task slots section.
> 4. There was no impact on job (it restarted successfully) but there are 48 
> taskmanagers running out of which 24 are standby. The expected behaviour 
> after a JM OOM with HA enabled is no starting of new task managers.
> 5. I have added the flink UI showing 24 extra TMs (48 task slots) screenshot 
> and kubectl output below.
> I also checked the kubernetes operator pod logs and I don't find anything 
> that could explain this behaviour. This has happened few times now with 
> different jobs and we have tried purposely OOMkilling jobmanager for one of 
> our test jobs many times we haven't been aple to reproduce this behaviour. It 
> looks to be an edge case which is difficult to reproduce.
> Can you please help us on how to debug this as kubernetes operator don't show 
> any relevant information on why this happened. Thanks and let me know if you 
> need further information. 
> kubectl get pod ouput showing 24 extra TMs:
> NAME                                      READY   STATUS    RESTARTS      AGE
> ioi-quality-667f575877-btfkv              1/1     Running   1 (39h ago)   
> 4d16h
> ioi-quality-taskmanager-1-1               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-10              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-11              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-12              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-13              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-14              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-15              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-16              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-17              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-18              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-19              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-2               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-20              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-21              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-22              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-23              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-24              1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-3               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-4               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-5               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-6               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-7               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-8               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-1-9               1/1     Running   0             
> 4d16h
> ioi-quality-taskmanager-2-1               1/1     Running   0             39h
> ioi-quality-taskmanager-2-10              1/1     Running   0             39h
> ioi-quality-taskmanager-2-11              1/1     Running   0             39h
> ioi-quality-taskmanager-2-12              1/1     Running   0             39h
> ioi-quality-taskmanager-2-13              1/1     Running   0             39h
> ioi-quality-taskmanager-2-14              1/1     Running   0             39h
> ioi-quality-taskmanager-2-15              1/1     Running   0             39h
> ioi-quality-taskmanager-2-16              1/1     Running   0             39h
> ioi-quality-taskmanager-2-17              1/1     Running   0             39h
> ioi-quality-taskmanager-2-18              1/1     Running   0             39h
> ioi-quality-taskmanager-2-19              1/1     Running   0             39h
> ioi-quality-taskmanager-2-2               1/1     Running   0             39h
> ioi-quality-taskmanager-2-20              1/1     Running   0             39h
> ioi-quality-taskmanager-2-21              1/1     Running   0             39h
> ioi-quality-taskmanager-2-22              1/1     Running   0             39h
> ioi-quality-taskmanager-2-23              1/1     Running   0             39h
> ioi-quality-taskmanager-2-24              1/1     Running   0             39h
> ioi-quality-taskmanager-2-3               1/1     Running   0             39h
> ioi-quality-taskmanager-2-4               1/1     Running   0             39h
> ioi-quality-taskmanager-2-5               1/1     Running   0             39h
> ioi-quality-taskmanager-2-6               1/1     Running   0             39h
> ioi-quality-taskmanager-2-7               1/1     Running   0             39h
> ioi-quality-taskmanager-2-8               1/1     Running   0             39h
> ioi-quality-taskmanager-2-9               1/1     Running   0             39h
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to