Wilfred Spiegelenburg created YARN-3742: -------------------------------------------
Summary: YARN RM will shut down if ZKClient creation times out Key: YARN-3742 URL: https://issues.apache.org/jira/browse/YARN-3742 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Wilfred Spiegelenburg The RM goes down showing the following stacktrace if the ZK client connection fails to be created. We should not exit but transition to StandBy and stop doing things and let the other RM take over. {code} 2015-04-19 01:22:20,513 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1066) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1090) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:996) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationStateInternal(ZKRMStateStore.java:643) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:162) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:147) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)