[ 
https://issues.apache.org/jira/browse/YARN-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA resolved YARN-2476.
----------------------------------
    Resolution: Duplicate

> Apps are scheduled in random order after RM failover
> ----------------------------------------------------
>
>                 Key: YARN-2476
>                 URL: https://issues.apache.org/jira/browse/YARN-2476
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.1
>         Environment: Linux
>            Reporter: Santosh Marella
>              Labels: ha, high-availability, resourcemanager
>
> RM HA is configured with 2 RMs. Used FileSystemRMStateStore.
> Fairscheduler allocation file is configured in yarn-site.xml:
> <property>
>   <name>yarn.scheduler.fair.allocation.file</name>
>   <value>/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop/allocation-pools.xml</value>
> </property>
> FS allocation-pools.xml:
> <?xml version="1.0"?>
> <allocations>
>    <queue name="dev">
>       <minResources>10000 mb,10vcores</minResources>
>           <maxResources>19000 mb,100vcores</maxResources>
>           <maxRunningApps>5525</maxRunningApps>
>           <weight>4.5</weight>
>           <schedulingPolicy>fair</schedulingPolicy>
>           <fairSharePreemptionTimeout>3600</fairSharePreemptionTimeout>
>    </queue>
>    <queue name="default">
>       <minResources>10000 mb,10vcores</minResources>
>           <maxResources>19000 mb,100vcores</maxResources>
>           <maxRunningApps>5525</maxRunningApps>
>           <weight>1.5</weight>
>           <schedulingPolicy>fair</schedulingPolicy>
>           <fairSharePreemptionTimeout>3600</fairSharePreemptionTimeout>
>    </queue>
>     <defaultMinSharePreemptionTimeout>600</defaultMinSharePreemptionTimeout>
>     <fairSharePreemptionTimeout>600</fairSharePreemptionTimeout>
> </allocations>
>     Submitted 10 sleep jobs to a FS queue using the command:
>     hadoop jar hadoop-mapreduce-examples-2.4.1-mapr-4.0.1-SNAPSHOT.jar sleep
>     -Dmapreduce.job.queuename=root.dev  -m 10 -r 10 -mt 10000 -rt 10000
>     All the jobs were submitted by the same user, with the same priority and 
> to the
>     same queue. No other jobs were running in the cluster. Jobs started 
> executing
>     in the order in which they were submitted (jobs 6 to 10 were active, 
> while 11
>     to 15 were waiting):
>     root@perfnode131:/opt/mapr/hadoop/hadoop-2.4.1/logs# yarn application 
> -list
>     Total number of applications (application-types: [] and states: 
> [SUBMITTED,ACCEPTED, RUNNING]):10
>     Application-Id      Application-Name        Application-Type User         
>   Queue                   State             Final-State Progress              
>           Tracking-URL
>     application_1408572781346_0012             Sleep job               
> MAPREDUCE userA        root.dev                ACCEPTED               
> UNDEFINED 0% N/A
>     application_1408572781346_0014             Sleep job               
> MAPREDUCE userA        root.dev                ACCEPTED               
> UNDEFINED 0% N/A
>     application_1408572781346_0011             Sleep job               
> MAPREDUCE userA        root.dev                ACCEPTED               
> UNDEFINED 0% N/A
>     application_1408572781346_0010             Sleep job               
> MAPREDUCE userA        root.dev                 RUNNING               
> UNDEFINED 5% http://perfnode132:52799
>     application_1408572781346_0008             Sleep job               
> MAPREDUCE userA        root.dev                 RUNNING               
> UNDEFINED 5% http://perfnode131:33766
>     application_1408572781346_0009             Sleep job               
> MAPREDUCE userA        root.dev                 RUNNING               
> UNDEFINED 5% http://perfnode132:50964
>     application_1408572781346_0007             Sleep job               
> MAPREDUCE userA        root.dev                 RUNNING               
> UNDEFINED 5% http://perfnode134:52966
>     application_1408572781346_0015             Sleep job               
> MAPREDUCE userA        root.dev                ACCEPTED               
> UNDEFINED 0% N/A
>     application_1408572781346_0006             Sleep job               
> MAPREDUCE userA        root.dev                 RUNNING               
> UNDEFINED 9.5% http://perfnode134:34094
>     application_1408572781346_0013             Sleep job               
> MAPREDUCE userA        root.dev                ACCEPTED               
> UNDEFINED 0%  N/A
>     Stopped RM1. There was a failover and RM2 became active. But the jobs 
> seem to
>     have started in a different order:
>     root@perfnode131:~/scratch/raw_rm_logs_fs_hang# yarn application -list
>     14/08/21 07:26:13 INFO client.ConfiguredRMFailoverProxyProvider: Failing 
> over to rm2
>     Total number of applications (application-types: [] and states: 
> [SUBMITTED,ACCEPTED, RUNNING]):10
>     Application-Id      Application-Name        Application-Type User         
>   Queue                   State             Final-State Progress              
>           Tracking-URL
>     application_1408572781346_0012             Sleep job               
> MAPREDUCE userA        root.dev                 RUNNING               
> UNDEFINED 5%http://perfnode134:59351
>     application_1408572781346_0014             Sleep job               
> MAPREDUCE userA        root.dev                 RUNNING               
> UNDEFINED 5%http://perfnode132:37866
>     application_1408572781346_0011             Sleep job               
> MAPREDUCE userA        root.dev                 RUNNING               
> UNDEFINED 5%http://perfnode131:59744
>     application_1408572781346_0010             Sleep job               
> MAPREDUCE userA        root.dev                ACCEPTED               
> UNDEFINED 0%N/A
>     application_1408572781346_0008             Sleep job               
> MAPREDUCE userA        root.dev                ACCEPTED               
> UNDEFINED 0%N/A
>     application_1408572781346_0009             Sleep job               
> MAPREDUCE userA        root.dev                ACCEPTED               
> UNDEFINED 0%N/A
>     application_1408572781346_0007             Sleep job               
> MAPREDUCE userA        root.dev                ACCEPTED               
> UNDEFINED 0%N/A
>     application_1408572781346_0015             Sleep job               
> MAPREDUCE userA        root.dev                 RUNNING               
> UNDEFINED 5%http://perfnode134:39754
>     application_1408572781346_0006             Sleep job               
> MAPREDUCE userA        root.dev                ACCEPTED               
> UNDEFINED 0%N/A
>     application_1408572781346_0013             Sleep job               
> MAPREDUCE userA        root.dev                 RUNNING               
> UNDEFINED 5%http://perfnode132:34714
> The problem is this:
> - The jobs that were previously in RUNNING state moved to ACCEPTED after 
> failover.
> - The jobs that were previously in ACCEPTED state moved to RUNNING after 
> failover.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to