[ https://issues.apache.org/jira/browse/YARN-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karthik Kambatla resolved YARN-6590. ------------------------------------ Resolution: Invalid [~wuchang1989], in the future, please reach out on our user mailing list for questions like this. Indeed you should configure {{yarn.resourcemanager.recovery.enabled}} for the ResourceManager to recover jobs. Recovery is documented here: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html HA (Failover) is documented here: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html > ResourceManager Master/Slave transition make all applications killed > -------------------------------------------------------------------- > > Key: YARN-6590 > URL: https://issues.apache.org/jira/browse/YARN-6590 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.3 > Environment: Linux > Reporter: wuchang > Priority: Critical > > My yarn is configured as HA . It seems that because of the zk connection > timeout , the active ResourceManager become standby and the standby one > become active,namely , the ResourceManager active/standby transition. But > both the process of two RM is OK . Below is the ResourceManager error log : > {noformat} > 2017-05-12 12:47:40,150 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: > Sending NMToken for nodeId : 10.120.117.100:37900 for container : > container_1494505293131_4378_01_000007 > 2017-05-12 12:47:40,150 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1494505293131_4378_01_000007 Container Transitioned from ALLOCATED > to ACQUIRED > 2017-05-12 12:47:40,150 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: > Sending NMToken for nodeId : 10.120.117.108:46066 for container : > container_1494505293131_4378_01_000008 > 2017-05-12 12:47:40,150 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1494505293131_4378_01_000008 Container Transitioned from ALLOCATED > to ACQUIRED > 2017-05-12 12:47:40,166 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server 10.120.117.104/10.120.117.104:2181. Will not attempt to > authenticate using SASL (unknown error) > 2017-05-12 12:47:40,168 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to 10.120.117.104/10.120.117.104:2181, initiating > session > 2017-05-12 12:47:40,170 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Session expired. Entering neutral mode and rejoining... > 2017-05-12 12:47:40,170 INFO org.apache.zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x685bcd9343dfc3f8 has expired, > closing socket connection > 2017-05-12 12:47:40,170 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Trying to re-establish ZK session > {noformat} > In my opinion , this active/standby transition *should not* make my running > application killed , but in fact , when this transition happened , all the > running YARN-BASED MR and Spark jobs are killed. Below is some of my yarn > configuration. > {code} > <property> > <name>yarn.resourcemanager.zk-address</name> > > <value>zkServer1:2181,zkServer2:2181,zkServer3:2181,zkServer4:2181</value> > </property> > <property> > <name>yarn.resourcemanager.zk-timeout-ms</name> > <value>30000</value> > </property> > <property> > <name>yarn.resourcemanager.store.class</name> > > <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> > </property> > <property> > > <name>yarn.resourcemanager.ha.automatic-failover.enabled</name> > <value>true</value> > </property> > {code} > So , any configuration missing?I notice that I didn't configure the > {noformat}yarn.resourcemanager.recovery.enabled{noformat} to true and the > default value is false.But according to the official document , this > configuration is used for ResourceManager restart, instead of for > ResourceManager Active/Standby transition. > Any suggestions? -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org