[jira] [Resolved] (YARN-6590) ResourceManager Master/Slave transition make all applications killed

Karthik Kambatla (JIRA) Fri, 12 May 2017 17:31:29 -0700

     [ 
https://issues.apache.org/jira/browse/YARN-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Karthik Kambatla resolved YARN-6590.
------------------------------------
    Resolution: Invalid

[~wuchang1989], in the future, please reach out on our user mailing list for 
questions like this.

Indeed you should configure {{yarn.resourcemanager.recovery.enabled}} for the 
ResourceManager to recover jobs. 

Recovery is documented here: 
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html

HA (Failover) is documented here: 
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html

> ResourceManager Master/Slave transition make all applications killed
> --------------------------------------------------------------------
>
>                 Key: YARN-6590
>                 URL: https://issues.apache.org/jira/browse/YARN-6590
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.3
>         Environment: Linux
>            Reporter: wuchang
>            Priority: Critical
>
> My yarn is configured as HA . It seems that because of the zk connection 
> timeout , the active ResourceManager become standby and the standby one 
> become active,namely , the ResourceManager active/standby transition. But 
> both the process of two RM  is OK . Below is the ResourceManager error log :
> {noformat}
> 2017-05-12 12:47:40,150 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
>  Sending NMToken for nodeId : 10.120.117.100:37900 for container : 
> container_1494505293131_4378_01_000007
> 2017-05-12 12:47:40,150 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1494505293131_4378_01_000007 Container Transitioned from ALLOCATED 
> to ACQUIRED
> 2017-05-12 12:47:40,150 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
>  Sending NMToken for nodeId : 10.120.117.108:46066 for container : 
> container_1494505293131_4378_01_000008
> 2017-05-12 12:47:40,150 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1494505293131_4378_01_000008 Container Transitioned from ALLOCATED 
> to ACQUIRED
> 2017-05-12 12:47:40,166 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server 10.120.117.104/10.120.117.104:2181. Will not attempt to 
> authenticate using SASL (unknown error)
> 2017-05-12 12:47:40,168 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to 10.120.117.104/10.120.117.104:2181, initiating 
> session
> 2017-05-12 12:47:40,170 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Session expired. Entering neutral mode and rejoining...
> 2017-05-12 12:47:40,170 INFO org.apache.zookeeper.ClientCnxn: Unable to 
> reconnect to ZooKeeper service, session 0x685bcd9343dfc3f8 has expired, 
> closing socket connection
> 2017-05-12 12:47:40,170 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Trying to re-establish ZK session
> {noformat}
> In my opinion , this active/standby transition *should not* make my running 
> application killed , but in fact , when this transition happened , all the 
> running YARN-BASED MR and Spark jobs are killed. Below is some of my yarn 
> configuration.
> {code}
>        <property>
>                 <name>yarn.resourcemanager.zk-address</name>
>                 
> <value>zkServer1:2181,zkServer2:2181,zkServer3:2181,zkServer4:2181</value>
>         </property>
>         <property>
>                 <name>yarn.resourcemanager.zk-timeout-ms</name>
>                 <value>30000</value>
>         </property>
>         <property>
>                 <name>yarn.resourcemanager.store.class</name>
>                 
> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>         </property>
>         <property>
>                 
> <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>                 <value>true</value>
>         </property>
> {code}
> So , any configuration missing?I notice that I didn't configure the 
> {noformat}yarn.resourcemanager.recovery.enabled{noformat} to true and the 
> default value is false.But according to the official document , this 
> configuration is used for ResourceManager restart, instead of for 
> ResourceManager Active/Standby transition.
> Any suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-6590) ResourceManager Master/Slave transition make all applications killed

Reply via email to