[ https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shimin Yang updated FLINK-9567: ------------------------------- Attachment: fulllog.txt > Flink does not release resource in Yarn Cluster mode > ---------------------------------------------------- > > Key: FLINK-9567 > URL: https://issues.apache.org/jira/browse/FLINK-9567 > Project: Flink > Issue Type: Bug > Components: Cluster Management, YARN > Affects Versions: 1.5.0 > Reporter: Shimin Yang > Priority: Major > Attachments: FlinkYarnProblem, fulllog.txt > > > After restart the Job Manager in Yarn Cluster mode, Flink does not release > task manager containers in some specific case. > In the first log I posted, the container with id 24 is the reason why Yarn > did not release resources. Although the Task Manager in the container with id > 24 was released before restart. > But in line 347, > 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - > Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has > failed, address is now gated for [50] ms. Reason: [Disassociated] > this problem caused flink to request for one more container more than need. > As the excessive container return id determined by the > *numPendingContainerRequests* variable in *YarnResourceManager*, I think it's > the *onContainersCompleted* in *YarnResourceManager* called the method > *requestYarnContainer* which leads to the increase of > *numPendingContainerRequests.* However, the restart logic has already > allocated enough containers for Task Managers, Flink will possess the extra > container for a long time for nothing. In the worst case, I had a job > configured to 5 task managers, but possess more than 100 containers in the > end. > ps: Another strange thing I found is that when sometimes request for a yarn > container, it will return much more than requested. Is it a normal scenario > for AMRMAsyncClient? -- This message was sent by Atlassian JIRA (v7.6.3#76005)