Ruibin Xing created FLINK-34347:
-----------------------------------
Summary: Kubernetes native resource manager request wrong spec.
Key: FLINK-34347
URL: https://issues.apache.org/jira/browse/FLINK-34347
Project: Flink
Issue Type: Bug
Components: Deployment / Kubernetes, Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.1, 1.18.0
Reporter: Ruibin Xing
Attachments: jobmanager.csv,
taskmanager_octopus-16-323-octopus-engine-write-proxy-taskmanager-3-326.csv
We had a flink spec in which TM cpu is set to 0.5, then we upgraded it to 4.0.
We found the job manager requesting both TM with 0.5 CPU and 4 CPU. Most TMs
with 0.5 CPU was released soon, however there was 1 TM with 0.5 CPU remained
and caused lag in job.
Logs for mixed TM requests:
{code:java}
2024-02-03 10:10:41,414 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Requested worker octopus-16-323-octopus-engine-write-proxy-taskmanager-3-244
with resource spec WorkerResourceSpec {cpuCores=4.0, taskHeapSize=5.637gb
(6053219520 bytes), taskOffHeapSize=1024.000mb (1073741824 bytes),
networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes,
numSlots=4}.02-03 18:10:44.8442024-02-03 10:10:44,844 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.5,
taskHeapSize=1.137gb (1221381320 bytes), taskOffHeapSize=1024.000mb (1073741824
bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes,
numSlots=4}, current pending count: 1.02-03 18:10:44.9202024-02-03 10:10:44,920
INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager []
- Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.5,
taskHeapSize=1.137gb (1221381320 bytes), taskOffHeapSize=1024.000mb (1073741824
bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes,
numSlots=4}, current pending count: 2.02-03 18:10:44.942 {code}
The name of wrong TM:
octopus-16-323-octopus-engine-write-proxy-taskmanager-3-326.
Relevant logs are attached.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)