[ https://issues.apache.org/jira/browse/YARN-5694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15706745#comment-15706745 ]
Daniel Templeton commented on YARN-5694: ---------------------------------------- Thanks, [~jianhe]. Yeah, it probably should go into 2.6 as well. > ZKRMStateStore can prevent the transition to standby in branch-2.7 if the ZK > node is unreachable > ------------------------------------------------------------------------------------------------ > > Key: YARN-5694 > URL: https://issues.apache.org/jira/browse/YARN-5694 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.7.3 > Reporter: Daniel Templeton > Assignee: Daniel Templeton > Priority: Critical > Labels: oct16-medium > Attachments: YARN-5694.001.patch, YARN-5694.002.patch, > YARN-5694.003.patch, YARN-5694.004.patch, YARN-5694.004.patch, > YARN-5694.005.patch, YARN-5694.006.patch, YARN-5694.007.patch, > YARN-5694.008.patch, YARN-5694.branch-2.7.001.patch, > YARN-5694.branch-2.7.002.patch, YARN-5694.branch-2.7.004.patch, > YARN-5694.branch-2.7.005.patch > > > {{ZKRMStateStore.doStoreMultiWithRetries()}} holds the lock while trying to > talk to ZK. If the connection fails, it will retry while still holding the > lock. The retries are intended to be strictly time limited, but in the case > that the ZK node is unreachable, the time limit fails, resulting in the > thread holding the lock for over an hour. Transitioning the RM to standby > requires that same lock, so in exactly the case that the RM should be > transitioning to standby, the {{VerifyActiveStatusThread}} blocks it from > happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org