[ https://issues.apache.org/jira/browse/YARN-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Fang Xie updated YARN-4892: --------------------------- Description: Enable resourcemanager recovery, set properties as below: <property> <description>Enable RM to recover state after starting. If true, then yarn.resourcemanager.store.class must be specified. </description> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <description> </description> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore</value> </property> <property> <description> </description> <name>yarn.resourcemanager.fs.state-store.uri</name> <value>hdfs://apple02:9000/rmstore</value> </property> run a distributedshell job, when job running, kill resourcemanager, and then restart resourcemanager, this job can not be finished and will be hung. Both fair-share and capacity scheduler have such issue. was: Enable resourcemanager recovery, set properties as below: <property> <description>Enable RM to recover state after starting. If true, then yarn.resourcemanager.store.class must be specified. </description> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <description> </description> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore</value> </property> <property> <description> </description> <name>yarn.resourcemanager.fs.state-store.uri</name> <value>hdfs://apple02:9000/rmstore</value> </property> run a distributedshell job, when job running, kill resourcemanager, and then restart resourcemanager, this job can not be finished and will be hung. > Job will be hung and can not be finished after resource manager restart and > enable recovery > ------------------------------------------------------------------------------------------- > > Key: YARN-4892 > URL: https://issues.apache.org/jira/browse/YARN-4892 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.7.0 > Reporter: Fang Xie > Priority: Critical > > Enable resourcemanager recovery, set properties as below: > <property> > <description>Enable RM to recover state after starting. If true, then > yarn.resourcemanager.store.class must be specified. </description> > <name>yarn.resourcemanager.recovery.enabled</name> > <value>true</value> > </property> > <property> > <description> </description> > <name>yarn.resourcemanager.store.class</name> > <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore</value> > </property> > <property> > <description> </description> > <name>yarn.resourcemanager.fs.state-store.uri</name> > <value>hdfs://apple02:9000/rmstore</value> > </property> > run a distributedshell job, when job running, kill resourcemanager, and then > restart resourcemanager, this job can not be finished and will be hung. > Both fair-share and capacity scheduler have such issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)