wgcn created FLINK-20138:
----------------------------
Summary: Flink Job can not recover due to timeout of requiring
slots when flink jobmanager restarted
Key: FLINK-20138
URL: https://issues.apache.org/jira/browse/FLINK-20138
Project: Flink
Issue Type: Bug
Components: Deployment / YARN, Table SQL / Runtime
Environment: flink : 1.9.2
hadoop :2.7.2
jdk:1.8
Reporter: wgcn
Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png
our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines
,and AMs of the machines restarted at other nodemanager. We found some
jobs can not recover due to timeout of requiring slots.
SlotPoolImp always did not connect ResourceManager
```
2020-11-09 16:31:31,794 INFO
flink-akka.actor.default-dispatcher-16
(org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
- Cannot serve slot request, no ResourceManager connected. Adding as pending
request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
```
1.We did not find the log of YarnResourceManager requesting container at the
jobmanager log of attachment.
2.The node of Zookeeper is also showed at attachment .
--
This message was sent by Atlassian Jira
(v8.3.4#803005)