[ 
https://issues.apache.org/jira/browse/FLINK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434945#comment-15434945
 ] 

ASF GitHub Bot commented on FLINK-4449:
---------------------------------------

Github user tillrohrmann commented on the issue:

    https://github.com/apache/flink/pull/2410
  
    Thanks for the contribution @beyond1920. The implementation of the 
`HeartbeatScheduler` goes in the right direction so that it is reusable :-) The 
testing is also better. However, we're still mixing different things in this PR 
(e.g. parts of the slot requesting logic).
    
    I think we can further generalize the heartbeating since the heartbeat 
manager is another component which should be reusable across components (e.g. 
for the JobManager to heartbeat the TMs). Furthermore, the receiving end of the 
heartbeating is not properly defined. 
    
    I think it would be best if we first properly define how this should look 
like. For example, I'm not sure whether the exponential backoff strategy is the 
right way to go since it can happen that you wait twice as long as you've 
defined until you're notified about a heartbeat failure. Another question is 
whether every heartbeat connection should be responsible for triggering itself 
or whether the heartbeat manager should be responsible for that. Then we have 
to define the receiving end. Is the heartbeat receiving end an independent 
`RpcEndpoint`? How does the payload delivery works? Does the sender side asks 
for the result (future) or does the receiving side answers via a tell message 
to the heartbeat manager?
    
    I've created an issue where we should continue the discussion 
https://issues.apache.org/jira/browse/FLINK-4478.


> Heartbeat Manager between ResourceManager and TaskExecutor
> ----------------------------------------------------------
>
>                 Key: FLINK-4449
>                 URL: https://issues.apache.org/jira/browse/FLINK-4449
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Cluster Management
>            Reporter: zhangjing
>            Assignee: zhangjing
>
> HeartbeatManager is responsible for heartbeat between resourceManager to 
> TaskExecutor
> 1. Register taskExecutors
> register heartbeat targets. If the heartbeat response for these targets is 
> not reported in time, mark target failed and notify resourceManager
> 2. trigger heartbeat
> trigger heartbeat from resourceManager to TaskExecutor periodically
> taskExecutor report slot allocation in the heartbeat response
> ResourceManager sync self slot allocation with the heartbeat response



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to