[
https://issues.apache.org/jira/browse/FLINK-37773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nishant More updated FLINK-37773:
---------------------------------
Attachment: Screenshot 2025-06-02 at 11.53.28 AM.png
> Extra TMs are started when Jobmanger is OOM killed in some FlinkDeployment
> runs
> -------------------------------------------------------------------------------
>
> Key: FLINK-37773
> URL: https://issues.apache.org/jira/browse/FLINK-37773
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: 1.10.0
> Reporter: Nihar Rao
> Priority: Major
> Attachments: Screenshot 2025-04-24 at 3.29.47 PM.png, Screenshot
> 2025-06-02 at 11.53.28 AM.png
>
>
> Hi,
> We are running into a weird issue with apache flink kubernetes operator
> 1.10.0 and apache flink 1.19.1. We are running jobs using native kubernetes
> application mode and FlinkDeployment CRD. We are running a job with 24
> taskmanagers and 1 Jobmanager replica with HA enabled.
> Below is the chronological summary of events:
> 1. Job was initially started with 24 task managers.
> 2. JM pod was OOMkilled and it is confirmed by our KSM metrics and {{kubectl
> describe pod <JM pod>}} shows the pod restarted due to OOM as well.
> 3. After JM was OOMkilled, JM was restarted and 24 new taskmanagers pods were
> started and is confirmed on flink UI on available task slots section.
> 4. There was no impact on job (it restarted successfully) but there are 48
> taskmanagers running out of which 24 are standby. The expected behaviour
> after a JM OOM with HA enabled is no starting of new task managers.
> 5. I have added the flink UI showing 24 extra TMs (48 task slots) screenshot
> and kubectl output below.
> I also checked the kubernetes operator pod logs and I don't find anything
> that could explain this behaviour. This has happened few times now with
> different jobs and we have tried purposely OOMkilling jobmanager for one of
> our test jobs many times we haven't been aple to reproduce this behaviour. It
> looks to be an edge case which is difficult to reproduce.
> Can you please help us on how to debug this as kubernetes operator don't show
> any relevant information on why this happened. Thanks and let me know if you
> need further information.
> kubectl get pod ouput showing 24 extra TMs:
> NAME READY STATUS RESTARTS AGE
> ioi-quality-667f575877-btfkv 1/1 Running 1 (39h ago)
> 4d16h
> ioi-quality-taskmanager-1-1 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-10 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-11 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-12 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-13 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-14 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-15 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-16 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-17 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-18 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-19 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-2 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-20 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-21 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-22 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-23 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-24 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-3 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-4 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-5 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-6 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-7 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-8 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-9 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-2-1 1/1 Running 0 39h
> ioi-quality-taskmanager-2-10 1/1 Running 0 39h
> ioi-quality-taskmanager-2-11 1/1 Running 0 39h
> ioi-quality-taskmanager-2-12 1/1 Running 0 39h
> ioi-quality-taskmanager-2-13 1/1 Running 0 39h
> ioi-quality-taskmanager-2-14 1/1 Running 0 39h
> ioi-quality-taskmanager-2-15 1/1 Running 0 39h
> ioi-quality-taskmanager-2-16 1/1 Running 0 39h
> ioi-quality-taskmanager-2-17 1/1 Running 0 39h
> ioi-quality-taskmanager-2-18 1/1 Running 0 39h
> ioi-quality-taskmanager-2-19 1/1 Running 0 39h
> ioi-quality-taskmanager-2-2 1/1 Running 0 39h
> ioi-quality-taskmanager-2-20 1/1 Running 0 39h
> ioi-quality-taskmanager-2-21 1/1 Running 0 39h
> ioi-quality-taskmanager-2-22 1/1 Running 0 39h
> ioi-quality-taskmanager-2-23 1/1 Running 0 39h
> ioi-quality-taskmanager-2-24 1/1 Running 0 39h
> ioi-quality-taskmanager-2-3 1/1 Running 0 39h
> ioi-quality-taskmanager-2-4 1/1 Running 0 39h
> ioi-quality-taskmanager-2-5 1/1 Running 0 39h
> ioi-quality-taskmanager-2-6 1/1 Running 0 39h
> ioi-quality-taskmanager-2-7 1/1 Running 0 39h
> ioi-quality-taskmanager-2-8 1/1 Running 0 39h
> ioi-quality-taskmanager-2-9 1/1 Running 0 39h
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)