[
https://issues.apache.org/jira/browse/FLINK-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430870#comment-15430870
]
ASF GitHub Bot commented on FLINK-4348:
---------------------------------------
Github user tillrohrmann commented on the issue:
https://github.com/apache/flink/pull/2389
Thanks for your contribution @beyond1920 :-) I've reviewed the PR and I
think it would be good if we split it up into several parts. The first part
could be the heartbeat logic a.k.a. `HeartbeatManager`. Here we could try to
implement a generic sending and receiving end. I think the implementation can
almost be independent of the JM, RM and TE implementation (similar to
`RetryingRegistration`). This will allow us to easily test this component.
The next step would be the integration of this component into the RM, JM
and TE.
Concerning the slot request logic I think we should wait a little bit for
the `SlotManager` implementation. It could be the case that the `SlotManager`
will make the rpcs to the `TaskExecutor` and not the RM. But for the moment the
interface is, afaik, not well enough specified to program against it.
The failure notification should also be treated in a separate PR imo. The
notification can have multiple origins (e.g. `HeartbeatManager` or the resource
management framework) and should be designed in such a way.
In general, I think the components should be more thoroughly tested with
more fine-grained unit tests. Furthermore, I think it would be good if we could
revise the code documentation a little bit.
> Implement communication from ResourceManager to TaskManager
> -----------------------------------------------------------
>
> Key: FLINK-4348
> URL: https://issues.apache.org/jira/browse/FLINK-4348
> Project: Flink
> Issue Type: Sub-task
> Components: Cluster Management
> Reporter: Kurt Young
> Assignee: zhangjing
>
> There are mainly 3 logics initiated from RM to TM:
> * Heartbeat, RM use heartbeat to sync with TM's slot status
> * SlotRequest, when RM decides to assign slot to JM, should first try to send
> request to TM for slot. TM can either accept or reject this request.
> * FailureNotify, in some corner cases, TM will be marked as invalid by
> cluster manager master(e.g. yarn master), but TM itself does not realize. RM
> should send failure notify to TM and TM can terminate itself
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)