chenwencan created YARN-11292: --------------------------------- Summary: resourcemanager no longer reconnects to zk Key: YARN-11292 URL: https://issues.apache.org/jira/browse/YARN-11292 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.3.3 Reporter: chenwencan
this problem has occurred in our environment ,the process of the problem is as follow: # network exception between resourcemanager and zookeeper # resourcemanger reconnect zookeeper successful # zookeeper session expire occurred # resourcemanager create new zookeeper client and reconnect it # if reconnect zk failed,will trigger RMFatalEvent # then start new thread to continue reconnect and rejoin election,while the variable hasAlreadyRun controll just run once,so if still reconnect failed,there have no chance to reconnect {code:java} private class StandByTransitionRunnable implements Runnable { // The atomic variable to make sure multiple threads with the same runnable // run only once. private final AtomicBoolean hasAlreadyRun = new AtomicBoolean(false); @Override public void run() { // Run this only once, even if multiple threads end up triggering // this simultaneously. if (hasAlreadyRun.getAndSet(true)) { return; } if (rmContext.isHAEnabled()) { try { // Transition to standby and reinit active services LOG.info("Transitioning RM to Standby mode"); transitionToStandby(true); EmbeddedElector elector = rmContext.getLeaderElectorService(); if (elector != null) { elector.rejoinElection(); } } catch (Exception e) { LOG.error(FATAL, "Failed to transition RM to Standby mode.", e); ExitUtil.terminate(1, e); } } } } {code} so, i think use a lock here will be better -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org