Baozhu Zhao created FLINK-37813:
-----------------------------------
Summary: JobManager failover during allocation slots causes
ResourceManager to release unwanted TaskManager failure
Key: FLINK-37813
URL: https://issues.apache.org/jira/browse/FLINK-37813
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.17.2
Environment: 环境描述:
Flink on k8s 运行环境
Flink 版本 1.17
作业需要3个taskmanager,单个taskmanager
10个slot。开启参数`cluster.fine-grained-resource-management.enabled=true`
Reporter: Baozhu Zhao
Attachments: new-tm.log, old-tm.log, 注册的tm.png
环境描述:
Flink on k8s 运行环境
作业需要3个taskmanager,单个taskmanager
10个slot。开启参数`cluster.fine-grained-resource-management.enabled=true`
问题描述:
jobmanager failover 后,注册的taskmanager slot report 不符合预期,导致闲置的taskmanager 无法被释放
复现步骤:
1、杀死某个 taskmanager,导致作业failover,slot manager 会重新allocate slot
到存量taskmanager,在slot 分配完成前,杀死 jobmanager ,作业会进入suspending 状态。[^old-tm.log]
2、新的JM 启动后,存量taskmanager 会注册,此时存量taskmanager注册的slotReport ,slot num
会比正常的taskmanager 多。导致resourcemanager 在定时检查并release闲置taskmanager
时,无法正确计算`releaseOrRequestWorkerNumber`,闲置的taskmanager 被释放。[^new-tm.log]
!注册的tm.png|width=372,height=190!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)