[
https://issues.apache.org/jira/browse/FLINK-37773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nihar Rao updated FLINK-37773:
------------------------------
Description:
Hi,
We are running into a weird issue with apache flink kubernetes operator 1.10.0
and apache flink 1.19.1. We are running jobs using native kubernetes
application mode and FlinkDeployment CRD. We are running a job with 24
taskmanagers and 1 Jobmanager replica with HA enabled.
Below is the chronological summary of events:
1. Job was initially started with 24 task managers.
2. JM pod was OOMkilled and it is confirmed by our KSM metrics and {{kubectl
describe pod <JM pod>}} shows the pod restarted due to OOM as well.
3. After JM was OOMkilled, JM was restarted and 24 new taskmanagers pods were
started and is confirmed on flink UI on available task slots section.
4. There was no impact on job (it restarted successfully) but there are 48
taskmanagers running out of which 24 are standby. The expected behaviour after
a JM OOM with HA enabled is no starting of new task managers.
5. I have added the flink UI showing 24 extra TMs (48 task slots) screenshot
and kubectl output below.
I also checked the kubernetes operator pod logs and I don't find anything that
could explain this behaviour. This has happened few times now with different
jobs and we have tried purposely OOMkilling jobmanager for one of our test jobs
many times we haven't been aple to reproduce this behaviour. It looks to be an
edge case which is difficult to reproduce.
Can you please help us on how to debug this as kubernetes operator don't show
any relevant information on why this happened. Thanks and let me know if you
need further information.
kubectl get pod ouput showing 24 extra TMs:
NAME READY STATUS RESTARTS AGE
ioi-quality-667f575877-btfkv 1/1 Running 1 (39h ago) 4d16h
ioi-quality-taskmanager-1-1 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-10 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-11 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-12 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-13 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-14 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-15 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-16 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-17 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-18 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-19 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-2 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-20 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-21 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-22 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-23 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-24 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-3 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-4 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-5 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-6 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-7 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-8 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-9 1/1 Running 0 4d16h
ioi-quality-taskmanager-2-1 1/1 Running 0 39h
ioi-quality-taskmanager-2-10 1/1 Running 0 39h
ioi-quality-taskmanager-2-11 1/1 Running 0 39h
ioi-quality-taskmanager-2-12 1/1 Running 0 39h
ioi-quality-taskmanager-2-13 1/1 Running 0 39h
ioi-quality-taskmanager-2-14 1/1 Running 0 39h
ioi-quality-taskmanager-2-15 1/1 Running 0 39h
ioi-quality-taskmanager-2-16 1/1 Running 0 39h
ioi-quality-taskmanager-2-17 1/1 Running 0 39h
ioi-quality-taskmanager-2-18 1/1 Running 0 39h
ioi-quality-taskmanager-2-19 1/1 Running 0 39h
ioi-quality-taskmanager-2-2 1/1 Running 0 39h
ioi-quality-taskmanager-2-20 1/1 Running 0 39h
ioi-quality-taskmanager-2-21 1/1 Running 0 39h
ioi-quality-taskmanager-2-22 1/1 Running 0 39h
ioi-quality-taskmanager-2-23 1/1 Running 0 39h
ioi-quality-taskmanager-2-24 1/1 Running 0 39h
ioi-quality-taskmanager-2-3 1/1 Running 0 39h
ioi-quality-taskmanager-2-4 1/1 Running 0 39h
ioi-quality-taskmanager-2-5 1/1 Running 0 39h
ioi-quality-taskmanager-2-6 1/1 Running 0 39h
ioi-quality-taskmanager-2-7 1/1 Running 0 39h
ioi-quality-taskmanager-2-8 1/1 Running 0 39h
ioi-quality-taskmanager-2-9 1/1 Running 0 39h
was:
Hi,
We are running into a weird issue with apache flink kubernetes operator 1.10.0
and apache flink 1.19.1. We are running jobs using native kubernetes
application mode and FlinkDeployment CRD. We are running a job with 24
taskmanagers and 1 Jobmanager replica with HA enabled.
Below is the chronological summary of events:
1. Job was initially started with 24 task managers.
2. JM pod was OOMkilled and it is confirmed by our KSM metrics and {{kubectl
describe pod <JM pod>}} shows the pod restarted due to OOM as well.
3. After JM was OOMkilled, JM was restarted and 24 new taskmanagers pods were
started and is confirmed on flink UI on available task slots section.
4. There was no impact on job (it restarted successfully) but there are 48
taskmanagers running out of which 24 are standby. The expected behaviour after
a JM OOM with HA enabled is no starting of new task managers.
5. I have added the flink UI showing 24 extra TMs (48 task slots) screenshot
and kubectl output below.
I also checked the kubernetes operator pod logs and I don't find anything that
could explain this behaviour. This has happened few times now with different
jobs and we have tried purposely OOMkilling jobmanager for one of our test jobs
many times we haven't been aple to reproduce this behaviour. It looks to be an
edge case which is difficult to reproduce.
Can you please help us on how to debug this as kubernetes operator don't show
any relevant information on why this happened. Thanks and let me know if you
need further information.
kubectl get pod ouput showing 24 extra TMs:
oi-quality-667f575877-btfkv 1/1 Running 1 (39h ago) 4d16h
ioi-quality-taskmanager-1-1 1/1 Running 0 4d16h ioi-quality-taskmanager-1-10
1/1 Running 0 4d16h ioi-quality-taskmanager-1-11 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-12 1/1 Running 0 4d16h ioi-quality-taskmanager-1-13
1/1 Running 0 4d16h ioi-quality-taskmanager-1-14 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-15 1/1 Running 0 4d16h ioi-quality-taskmanager-1-16
1/1 Running 0 4d16h ioi-quality-taskmanager-1-17 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-18 1/1 Running 0 4d16h ioi-quality-taskmanager-1-19
1/1 Running 0 4d16h ioi-quality-taskmanager-1-2 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-20 1/1 Running 0 4d16h ioi-quality-taskmanager-1-21
1/1 Running 0 4d16h ioi-quality-taskmanager-1-22 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-23 1/1 Running 0 4d16h ioi-quality-taskmanager-1-24
1/1 Running 0 4d16h ioi-quality-taskmanager-1-3 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-4 1/1 Running 0 4d16h ioi-quality-taskmanager-1-5 1/1
Running 0 4d16h ioi-quality-taskmanager-1-6 1/1 Running 0 4d16h
ioi-quality-taskmanager-1-7 1/1 Running 0 4d16h ioi-quality-taskmanager-1-8 1/1
Running 0 4d16h ioi-quality-taskmanager-1-9 1/1 Running 0 4d16h
ioi-quality-taskmanager-2-1 1/1 Running 0 39h ioi-quality-taskmanager-2-10 1/1
Running 0 39h ioi-quality-taskmanager-2-11 1/1 Running 0 39h
ioi-quality-taskmanager-2-12 1/1 Running 0 39h ioi-quality-taskmanager-2-13 1/1
Running 0 39h ioi-quality-taskmanager-2-14 1/1 Running 0 39h
ioi-quality-taskmanager-2-15 1/1 Running 0 39h ioi-quality-taskmanager-2-16 1/1
Running 0 39h ioi-quality-taskmanager-2-17 1/1 Running 0 39h
ioi-quality-taskmanager-2-18 1/1 Running 0 39h ioi-quality-taskmanager-2-19 1/1
Running 0 39h ioi-quality-taskmanager-2-2 1/1 Running 0 39h
ioi-quality-taskmanager-2-20 1/1 Running 0 39h ioi-quality-taskmanager-2-21 1/1
Running 0 39h ioi-quality-taskmanager-2-22 1/1 Running 0 39h
ioi-quality-taskmanager-2-23 1/1 Running 0 39h ioi-quality-taskmanager-2-24 1/1
Running 0 39h ioi-quality-taskmanager-2-3 1/1 Running 0 39h
ioi-quality-taskmanager-2-4 1/1 Running 0 39h ioi-quality-taskmanager-2-5 1/1
Running 0 39h ioi-quality-taskmanager-2-6 1/1 Running 0 39h
ioi-quality-taskmanager-2-7 1/1 Running 0 39h ioi-quality-taskmanager-2-8 1/1
Running 0 39h ioi-quality-taskmanager-2-9 1/1 Running 0 39h
> Extra TMs are started when Jobmanger is OOM killed in some FlinkDeployment
> runs
> -------------------------------------------------------------------------------
>
> Key: FLINK-37773
> URL: https://issues.apache.org/jira/browse/FLINK-37773
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: 1.10.0
> Reporter: Nihar Rao
> Priority: Major
> Attachments: Screenshot 2025-04-24 at 3.29.47 PM.png
>
>
> Hi,
> We are running into a weird issue with apache flink kubernetes operator
> 1.10.0 and apache flink 1.19.1. We are running jobs using native kubernetes
> application mode and FlinkDeployment CRD. We are running a job with 24
> taskmanagers and 1 Jobmanager replica with HA enabled.
> Below is the chronological summary of events:
> 1. Job was initially started with 24 task managers.
> 2. JM pod was OOMkilled and it is confirmed by our KSM metrics and {{kubectl
> describe pod <JM pod>}} shows the pod restarted due to OOM as well.
> 3. After JM was OOMkilled, JM was restarted and 24 new taskmanagers pods were
> started and is confirmed on flink UI on available task slots section.
> 4. There was no impact on job (it restarted successfully) but there are 48
> taskmanagers running out of which 24 are standby. The expected behaviour
> after a JM OOM with HA enabled is no starting of new task managers.
> 5. I have added the flink UI showing 24 extra TMs (48 task slots) screenshot
> and kubectl output below.
> I also checked the kubernetes operator pod logs and I don't find anything
> that could explain this behaviour. This has happened few times now with
> different jobs and we have tried purposely OOMkilling jobmanager for one of
> our test jobs many times we haven't been aple to reproduce this behaviour. It
> looks to be an edge case which is difficult to reproduce.
> Can you please help us on how to debug this as kubernetes operator don't show
> any relevant information on why this happened. Thanks and let me know if you
> need further information.
> kubectl get pod ouput showing 24 extra TMs:
> NAME READY STATUS RESTARTS AGE
> ioi-quality-667f575877-btfkv 1/1 Running 1 (39h ago)
> 4d16h
> ioi-quality-taskmanager-1-1 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-10 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-11 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-12 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-13 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-14 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-15 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-16 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-17 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-18 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-19 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-2 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-20 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-21 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-22 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-23 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-24 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-3 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-4 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-5 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-6 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-7 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-8 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-1-9 1/1 Running 0
> 4d16h
> ioi-quality-taskmanager-2-1 1/1 Running 0 39h
> ioi-quality-taskmanager-2-10 1/1 Running 0 39h
> ioi-quality-taskmanager-2-11 1/1 Running 0 39h
> ioi-quality-taskmanager-2-12 1/1 Running 0 39h
> ioi-quality-taskmanager-2-13 1/1 Running 0 39h
> ioi-quality-taskmanager-2-14 1/1 Running 0 39h
> ioi-quality-taskmanager-2-15 1/1 Running 0 39h
> ioi-quality-taskmanager-2-16 1/1 Running 0 39h
> ioi-quality-taskmanager-2-17 1/1 Running 0 39h
> ioi-quality-taskmanager-2-18 1/1 Running 0 39h
> ioi-quality-taskmanager-2-19 1/1 Running 0 39h
> ioi-quality-taskmanager-2-2 1/1 Running 0 39h
> ioi-quality-taskmanager-2-20 1/1 Running 0 39h
> ioi-quality-taskmanager-2-21 1/1 Running 0 39h
> ioi-quality-taskmanager-2-22 1/1 Running 0 39h
> ioi-quality-taskmanager-2-23 1/1 Running 0 39h
> ioi-quality-taskmanager-2-24 1/1 Running 0 39h
> ioi-quality-taskmanager-2-3 1/1 Running 0 39h
> ioi-quality-taskmanager-2-4 1/1 Running 0 39h
> ioi-quality-taskmanager-2-5 1/1 Running 0 39h
> ioi-quality-taskmanager-2-6 1/1 Running 0 39h
> ioi-quality-taskmanager-2-7 1/1 Running 0 39h
> ioi-quality-taskmanager-2-8 1/1 Running 0 39h
> ioi-quality-taskmanager-2-9 1/1 Running 0 39h
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)