[ https://issues.apache.org/jira/browse/YARN-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsuyoshi OZAWA resolved YARN-2476. ---------------------------------- Resolution: Duplicate > Apps are scheduled in random order after RM failover > ---------------------------------------------------- > > Key: YARN-2476 > URL: https://issues.apache.org/jira/browse/YARN-2476 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.4.1 > Environment: Linux > Reporter: Santosh Marella > Labels: ha, high-availability, resourcemanager > > RM HA is configured with 2 RMs. Used FileSystemRMStateStore. > Fairscheduler allocation file is configured in yarn-site.xml: > <property> > <name>yarn.scheduler.fair.allocation.file</name> > <value>/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop/allocation-pools.xml</value> > </property> > FS allocation-pools.xml: > <?xml version="1.0"?> > <allocations> > <queue name="dev"> > <minResources>10000 mb,10vcores</minResources> > <maxResources>19000 mb,100vcores</maxResources> > <maxRunningApps>5525</maxRunningApps> > <weight>4.5</weight> > <schedulingPolicy>fair</schedulingPolicy> > <fairSharePreemptionTimeout>3600</fairSharePreemptionTimeout> > </queue> > <queue name="default"> > <minResources>10000 mb,10vcores</minResources> > <maxResources>19000 mb,100vcores</maxResources> > <maxRunningApps>5525</maxRunningApps> > <weight>1.5</weight> > <schedulingPolicy>fair</schedulingPolicy> > <fairSharePreemptionTimeout>3600</fairSharePreemptionTimeout> > </queue> > <defaultMinSharePreemptionTimeout>600</defaultMinSharePreemptionTimeout> > <fairSharePreemptionTimeout>600</fairSharePreemptionTimeout> > </allocations> > Submitted 10 sleep jobs to a FS queue using the command: > hadoop jar hadoop-mapreduce-examples-2.4.1-mapr-4.0.1-SNAPSHOT.jar sleep > -Dmapreduce.job.queuename=root.dev -m 10 -r 10 -mt 10000 -rt 10000 > All the jobs were submitted by the same user, with the same priority and > to the > same queue. No other jobs were running in the cluster. Jobs started > executing > in the order in which they were submitted (jobs 6 to 10 were active, > while 11 > to 15 were waiting): > root@perfnode131:/opt/mapr/hadoop/hadoop-2.4.1/logs# yarn application > -list > Total number of applications (application-types: [] and states: > [SUBMITTED,ACCEPTED, RUNNING]):10 > Application-Id Application-Name Application-Type User > Queue State Final-State Progress > Tracking-URL > application_1408572781346_0012 Sleep job > MAPREDUCE userA root.dev ACCEPTED > UNDEFINED 0% N/A > application_1408572781346_0014 Sleep job > MAPREDUCE userA root.dev ACCEPTED > UNDEFINED 0% N/A > application_1408572781346_0011 Sleep job > MAPREDUCE userA root.dev ACCEPTED > UNDEFINED 0% N/A > application_1408572781346_0010 Sleep job > MAPREDUCE userA root.dev RUNNING > UNDEFINED 5% http://perfnode132:52799 > application_1408572781346_0008 Sleep job > MAPREDUCE userA root.dev RUNNING > UNDEFINED 5% http://perfnode131:33766 > application_1408572781346_0009 Sleep job > MAPREDUCE userA root.dev RUNNING > UNDEFINED 5% http://perfnode132:50964 > application_1408572781346_0007 Sleep job > MAPREDUCE userA root.dev RUNNING > UNDEFINED 5% http://perfnode134:52966 > application_1408572781346_0015 Sleep job > MAPREDUCE userA root.dev ACCEPTED > UNDEFINED 0% N/A > application_1408572781346_0006 Sleep job > MAPREDUCE userA root.dev RUNNING > UNDEFINED 9.5% http://perfnode134:34094 > application_1408572781346_0013 Sleep job > MAPREDUCE userA root.dev ACCEPTED > UNDEFINED 0% N/A > Stopped RM1. There was a failover and RM2 became active. But the jobs > seem to > have started in a different order: > root@perfnode131:~/scratch/raw_rm_logs_fs_hang# yarn application -list > 14/08/21 07:26:13 INFO client.ConfiguredRMFailoverProxyProvider: Failing > over to rm2 > Total number of applications (application-types: [] and states: > [SUBMITTED,ACCEPTED, RUNNING]):10 > Application-Id Application-Name Application-Type User > Queue State Final-State Progress > Tracking-URL > application_1408572781346_0012 Sleep job > MAPREDUCE userA root.dev RUNNING > UNDEFINED 5%http://perfnode134:59351 > application_1408572781346_0014 Sleep job > MAPREDUCE userA root.dev RUNNING > UNDEFINED 5%http://perfnode132:37866 > application_1408572781346_0011 Sleep job > MAPREDUCE userA root.dev RUNNING > UNDEFINED 5%http://perfnode131:59744 > application_1408572781346_0010 Sleep job > MAPREDUCE userA root.dev ACCEPTED > UNDEFINED 0%N/A > application_1408572781346_0008 Sleep job > MAPREDUCE userA root.dev ACCEPTED > UNDEFINED 0%N/A > application_1408572781346_0009 Sleep job > MAPREDUCE userA root.dev ACCEPTED > UNDEFINED 0%N/A > application_1408572781346_0007 Sleep job > MAPREDUCE userA root.dev ACCEPTED > UNDEFINED 0%N/A > application_1408572781346_0015 Sleep job > MAPREDUCE userA root.dev RUNNING > UNDEFINED 5%http://perfnode134:39754 > application_1408572781346_0006 Sleep job > MAPREDUCE userA root.dev ACCEPTED > UNDEFINED 0%N/A > application_1408572781346_0013 Sleep job > MAPREDUCE userA root.dev RUNNING > UNDEFINED 5%http://perfnode132:34714 > The problem is this: > - The jobs that were previously in RUNNING state moved to ACCEPTED after > failover. > - The jobs that were previously in ACCEPTED state moved to RUNNING after > failover. -- This message was sent by Atlassian JIRA (v6.3.4#6332)