[ https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15525172#comment-15525172 ]
Jian He commented on YARN-5677: ------------------------------- bq. when the active RM loses contact with the ZK node Definitely, the leader elector in RM should not keep retrying in this case. I don't remember this issue is fixed or not (cc [~xgong]). IIUC, the leader elector in this case should notice that the connection is lost, and signal RM to transition to standby. Are you testing with the latest code in trunk? > RM can be in active-active state for an extended period > ------------------------------------------------------- > > Key: YARN-5677 > URL: https://issues.apache.org/jira/browse/YARN-5677 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.7.3, 3.0.0-alpha1 > Reporter: Daniel Templeton > Assignee: Daniel Templeton > Priority: Critical > > Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses > contact with the ZK node(s). > In branch-2.7, the RM will retry the connection 1000 times by default. > Attempting to contact a node which cannot be reached is slow, which means the > active can take over an hour to realize it is no longer active. I clocked it > at about an hour and a half in my tests. The solution appears to be to add > some time awareness into the retry loop. > In branch-2.8/trunk, there is no maximum number of retries that I see. It > appears the connection will be retried forever, with the active never > figuring out it's no longer active. In my testing, the active-active state > lasted almost 2 hours with no sign of stopping before I killed it. The > solution appears to be to cap the number of retries or amount of time spent > retrying. > This issue is significant because of the asynchronous nature of job > submission. If the active doesn't know it's not active, it will buffer up > job submissions until it finally realizes it has become the standby. Then it > will fail all the job submissions in bulk. In high-volume workflows, that > behavior can create huge mass job failures. > This issue is also important because the node managers will not fail over to > the new active until the old active realizes it's the standby. Workloads > submitted after the old active loses contact with ZK will therefore fail to > be executed regardless of which RM the clients contact. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org