Github user KurtYoung commented on the issue:
https://github.com/apache/flink/pull/2571
@mxm Thanks for your thoughts, i really like the discussion.:smirk:
You just pointed out another point which i missed in the previous reply.
Actually, i noticed that right after i posted the previous reply, as it turns
out, my solution was exactly the same as you proposed!
Some comments inline:
> It is guaranteed that the ResourceManager will receive an RPC response of
some sort. Either a reply from the TaskExecutor, or a timeout/error which is
returned by the future. If the request is then retried, the TaskExecutor
receives the same request twice but will simply acknowledge it again
I think it's no need to stick to the failed slot when the allocation fails
by rpc. Just put it back to the free pool, and give us another shot. Actually,
i think the pending requests acts like your extra list of unconfirmed requests.
(And you pointed out last, we dont need this list indeed as TaskManager will
correct our fault by rejecting allocation).
>There is one more problem thought. How to prevent a false request from the
ResourceManager to the TaskExecutor in case the ResourceManager hasn't received
a reply from the TaskExecutor but the TaskExecutor has already removed the slot
again
When the allocation fails by rpc and we only have one free slot, it's true
that we will keep retrying the same slot and keeping failing by rpc. And
actually the task are finished, the slot becomes free again. Then out request
reached TaskManager. It's ok for TaskManager to accept the request, at the end,
JobManager will reject this allocation, and the slot will become free again.
>Again, the only solution for this problem seems to be to keep a list of
unconfirmed slot allocation removal requests at the TaskExecutor.
Yes, i also thought this might be a solution. And i think this can work
with the Heartbeat manager, since if you cannot send the free message to RM,
you will not be able to send heartbeat too. After some timeout, RM will treat
the TaskManager as dead, and some garbage collection logic in RM will take care
all the allocations and slots which belong to this TaskManager.
All in all, I think this version is still much simpler than the first one.
:smirk:
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---