Github user KurtYoung commented on the issue:
https://github.com/apache/flink/pull/2571
@mxm Thanks for the simplification, i like the idea. When i wrote the first
version of the SlotManager, i have noticed maybe i made things too complicated,
but i didn't figure out how to make things more simplify.
As it turns out, your modification covered two main problems i have faced:
1. What information to exchange during heartbeats, and what actions should
we take
2. What action should we take when the allocation failed at TaskManager
But what i really want to find out is: In this case, is there a simple
paradigm which we can follow to make whole thing clear and robust. What i
previous choose is: Take actions based on my newest runtime information. But as
you can see, it leads me to a very complex solution, each time i decide what
action should be taken, i should to check all related information and consider
all possibilities. (even it seems hard to understand why that will happen).
Your modification gives me some tips, maybe we can simplify it with
following ways:
1. RM and TM only exchange information when needed ( so heartbeat dont sync
status )
2. TM only report informations which it can changed by itself ( like slot
be free again )
Here is some thoughts about the modification:
1. We can remove the update status part entirely, since it can only do new
slot registration now, we can just move it to the task executor first
registration.
2. Once a slot becomes free in TM, notify RM
3. TM should attach the slot usage when rejecting the allocation from RM
Here is some minor problems i found in this modification:
a. As beyond1920 metioned, we dont have a way to find out a slot becomes
free ( this can be done by 2)
b. When we handleSlotRequestFailedAtTaskManager, we will make this slot
free again. If the slot is occupied by some other task now, we will
continuously failed for all allocation on this slot. ( this can be fixed by 3)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---