Shimin Yang created FLINK-9567: ---------------------------------- Summary: Flink does not release resource in Yarn Cluster mode Key: FLINK-9567 URL: https://issues.apache.org/jira/browse/FLINK-9567 Project: Flink Issue Type: Bug Components: Cluster Management, YARN Affects Versions: 1.5.0 Reporter: Shimin Yang
After restart the Job Manager in Yarn Cluster mode, Flink does not release task manager containers in some specific case. According to my observation, the reason is the instance variable *numPendingContainerRequests* in *YarnResourceManager* class does not decrease since it has not received the containers. And after restart of job manager, it will make increase the *numPendingContainerRequests* by the number of task executors. Since the callback function *onContainersAllocated* will return the excessive container immediately only if the *numPendingContainerRequests* <= 0, so the number of container grows bigger and bigger while only a few are acting as task manager. I think it is important to clear the *numPendingContainerRequests* variable after restart the Job Manager, but not very clear at how to do that. There's no other way to decrease the *numPendingContainerRequests* except the *onContainersAllocated*. Is it fine to add a method to operate on the *numPendingContainerRequests* variable? And meanwhile, there's no handle of YarnResourceManager in the *ExecutionGraph* restart logic. -- This message was sent by Atlassian JIRA (v7.6.3#76005)