Github user mxm commented on the issue:
https://github.com/apache/flink/pull/2571
Great to hear that we're on the same page :)
>I think it's no need to stick to the failed slot when the allocation fails
by rpc. Just put it back to the free pool, and give us another shot.
Yes, we can simply trigger processing of pending requests via
`handleFreeSlot`.
>Actually, i think the pending requests acts like your extra list of
unconfirmed requests. (And you pointed at last, we actually dont need this list
as TaskManager will correct our faultd by rejecting allocation).
I think PendingRequests is not the same because it is a list of outstanding
requests but not requests that have been issued to TaskExecutors. But as we
found out, we don't need to have a special list for that on the ResourceManager
side.
>Yes, i also thought this might be a solution. And i think this can work
with the Heartbeat manager, since if you cannot send the free message to RM,
you will not be able to send heartbeat too. After some timeout, RM will treat
the TaskManager as dead, and some garbage collection logic in RM will take care
all the allocations and slots which belong to this TaskManager.
Are you saying you would rather let the HeartbeatManager send out the
removal of slots? That would work but depending on the heartbeat interval this
could take slightly longer. Semantically, it doesn't make much difference.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---