Botong Huang created YARN-8696:
----------------------------------

             Summary: FederationInterceptor upgrade: home sub-cluster heartbeat 
async
                 Key: YARN-8696
                 URL: https://issues.apache.org/jira/browse/YARN-8696
             Project: Hadoop YARN
          Issue Type: Task
            Reporter: Botong Huang
            Assignee: Botong Huang


Today in _FederationInterceptor_, the heartbeat to home sub-cluster is 
synchronous. After the heartbeat is sent out to home sub-cluster, it waits for 
the home response to come back before merging and returning the (merged) 
heartbeat result to back AM. If home sub-cluster is suffering from connection 
issues, or down during an YarnRM master-slave switch, all heartbeat threads in 
_FederationInterceptor_ will be blocked waiting for home response. As a result, 
the successful UAM heartbeats from secondary sub-clusters will not be returned 
to AM at all. Additionally, because of the fact that we kept the same heartbeat 
responseId between AM and home RM, lots of tricky handling are needed regarding 
the responseId resync when it comes to _FederationInterceptor_ (part of 
AMRMProxy, NM) work preserving restart (YARN-6127, YARN-1336), home RM 
master-slave switch etc. 

In this patch, we change the heartbeat to home sub-cluster to asynchronous, 
same as the way we handle UAM heartbeats in secondaries. So that any 
sub-cluster down or connection issues won't impact AM getting responses from 
other sub-clusters. The responseId is also managed separately for home 
sub-cluster and AM, and they increment independently. The resync logic becomes 
much cleaner. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to