Github user mxm commented on the issue:
https://github.com/apache/flink/pull/2571
Merged
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user mxm commented on the issue:
https://github.com/apache/flink/pull/2571
Rebased to latest `flip-6`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user mxm commented on the issue:
https://github.com/apache/flink/pull/2571
Thanks for the comments @KurtYoung. I've updated the PR. No worries, four
eyes always see more than two and it is natural that ideas progress over time :)
---
If your project is set up for it, you can
Github user mxm commented on the issue:
https://github.com/apache/flink/pull/2571
As discussed in this PR, I've pushed the following changes:
1. Move the slot registration and allocation report to the registration
of the TaskExecutor
2. Let the TaskExecutor
Github user mxm commented on the issue:
https://github.com/apache/flink/pull/2571
@KurtYoung I see, that makes sense. When the TM looses connection to the
RM, it will eventually be detected by the heartbeating and the list of pending
slot allocation removal requests will be cleared.
Github user KurtYoung commented on the issue:
https://github.com/apache/flink/pull/2571
@mxm Regarding the last heartbeat thing, what i really trying to say is,
the failing free request will not stuck in the unconfirmed list forever, since
we have the heartbeat manager to monitoring
Github user mxm commented on the issue:
https://github.com/apache/flink/pull/2571
Great to hear that we're on the same page :)
>I think it's no need to stick to the failed slot when the allocation fails
by rpc. Just put it back to the free pool, and give us another shot.
Github user KurtYoung commented on the issue:
https://github.com/apache/flink/pull/2571
@mxm Thanks for your thoughts, i really like the discussion.:smirk:
You just pointed out another point which i missed in the previous reply.
Actually, i noticed that right after i posted
Github user mxm commented on the issue:
https://github.com/apache/flink/pull/2571
@KurtYoung It is guaranteed that the ResourceManager will receive an RPC
response of some sort. Either a reply from the TaskExecutor, or a timeout/error
which is returned by the future. If the request
Github user KurtYoung commented on the issue:
https://github.com/apache/flink/pull/2571
@mxm I think that situation will happen when RM request slot from TM, TM
accept the request, but the response was somehow missed. Typically, when an rpc
error occurred, the send could not know
Github user mxm commented on the issue:
https://github.com/apache/flink/pull/2571
Thanks for the feedback, @beyond1920 and @KurtYoung.
You're right, the changes don't allow slots to be released by a
TaskExecutor. We can change that by explicitly reporting to the RM if a slot
Github user KurtYoung commented on the issue:
https://github.com/apache/flink/pull/2571
@mxm Thanks for the simplification, i like the idea. When i wrote the first
version of the SlotManager, i have noticed maybe i made things too complicated,
but i didn't figure out how to make
Github user beyond1920 commented on the issue:
https://github.com/apache/flink/pull/2571
@mxm, What would happen under following cases: taskExecutor releases a
registered slot, then taskExecutor reports its latest slotReport to
ResourceManager, ResourceManager should remove the old
Github user mxm commented on the issue:
https://github.com/apache/flink/pull/2571
CC @beyond1920 @KurtYoung
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
14 matches
Mail list logo