[ https://issues.apache.org/jira/browse/MAPREDUCE-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603239#comment-13603239 ]
Vitaly Kruglikov commented on MAPREDUCE-5068: --------------------------------------------- [~sandyr]: Confirmed -- the same issue is also present in the latest hadoop CDH4.2.0 distribution. > Fair Scheduler preemption fails if the other queue has a mapreduce job with > some tasks in excess of cluster capacity > -------------------------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-5068 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5068 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, scheduler > Environment: Mac OS X; CDH4.1.2 > Reporter: Vitaly Kruglikov > Labels: hadoop > > This is reliably reproduced while running CDH4.1.2 on a single Mac OS X > machine. > # Two queues are being configured: cjmQ and slotsQ. Both queues are > configured with tiny minResources. The intention is for the task(s) of the > job in cjmQ to be able to preempt tasks of the job in slotsQ. > # yarn.nodemanager.resource.memory-mb = 24576 > # First, a long-running 6-map-task (0 reducers) mapreduce job is started in > slotsQ with mapreduce.map.memory.mb=4096. Because MRAppMaster's container > consumes some memory, only 5 of its 6 map tasks are able to start, and the > 6th is pending, but will never run. > # Then, a short-running 1-map-task (0 reducers) mapreduce job is submitted > via cjmQ with mapreduce.map.memory.mb=2048. > Expected behavior: > At this point, because the minimum share of cjmQ has not been met, I expected > Fair Scheduler to preempt one of the executing map tasks from the single > slotsQ mapreduce job to make room for the single map tasks of the cjmQ > mapreduce job. However, Fair Scheduler didn't preempt any of the running map > tasks of the slotsQ job. Instead, the cjmQ job was being starved perpetually. > Since slotsQ had far more than its minimum share allocated to it and already > running, while cjmQ was far below its minimum share (0 actually), Fair > Scheduler should have started preempting, regardless of there being one task > container from the slotsQ job (the 6th map container) that was not being > allocated. > Additional useful info: > # If I summit a second 1-map-task mapreduce job via cjmQ, the first cjmQ > mapreduce job in that Q gets scheduled and its state changes to RUNNING; once > that that first job completes, then the second job submitted via cjmQ gets > starved until a third job is submitted into cjmQ, and so on. This happens > regardless of the values of maxRunningApps in the queue configurations. > # If, instead of requesting 6 map tasks for the slotsQ job, I only request 5 > so that everything fits nicely into yarn.nodemanager.resource.memory-mb - > without that 6th pending, but not running task - then preemption works as I > would have expected. However, I cannot rely on this arrangement because in a > production cluster that is running at full capacity, if a machine dies, the > mapreduce job from slotsQ will request new containers for the failed tasks > and because the cluster was already at capacity, those containers will end up > as pending and will never run, recreating my original scenario of the > starving cjmQ job. > # I initially wrote this up on > https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/0zv62pkN5lM, > so it would be good to update that group with the resolution. > Configuration: > In yarn-site.xml: > {code} > <property> > <description>Scheduler plug-in class to use instead of the default > scheduler.</description> > <name>yarn.resourcemanager.scheduler.class</name> > > <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value> > </property> > {code} > fair-scheduler.xml: > {code} > <configuration> > <!-- Site specific FairScheduler configuration properties --> > <property> > <description>Absolute path to allocation file. An allocation file is an > XML > manifest describing queues and their properties, in addition to certain > policy defaults. This file must be in XML format as described in > > http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html. > </description> > <name>yarn.scheduler.fair.allocation.file</name> > > <value>[obfuscated]/current/conf/site/default/hadoop/fair-scheduler-allocations.xml</value> > </property> > <property> > <description>Whether to use preemption. Note that preemption is > experimental > in the current version. Defaults to false.</description> > <name>yarn.scheduler.fair.preemption</name> > <value>true</value> > </property> > <property> > <description>Whether to allow multiple container assignments in one > heartbeat. Defaults to false.</description> > <name>yarn.scheduler.fair.assignmultiple</name> > <value>true</value> > </property> > > </configuration> > {code} > My fair-scheduler-allocations.xml: > {code} > <allocations> > <queue name="cjmQ"> > <!-- minimum amount of aggregate memory; TODO which units??? --> > <minResources>2048</minResources> > <!-- limit the number of apps from the queue to run at once --> > <maxRunningApps>1</maxRunningApps> > > <!-- either "fifo" or "fair" depending on the in-queue scheduling policy > desired --> > <schedulingMode>fifo</schedulingMode> > <!-- Number of seconds after which the pool can preempt other pools' > tasks to achieve its min share. Requires preemption to be enabled in > mapred-site.xml by setting mapred.fairscheduler.preemption to true. > Defaults to infinity (no preemption). --> > <minSharePreemptionTimeout>5</minSharePreemptionTimeout> > <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. --> > <weight>1.0</weight> > </queue> > <queue name="slotsQ"> > <!-- minimum amount of aggregate memory; TODO which units??? --> > <minResources>1</minResources> > <!-- limit the number of apps from the queue to run at once --> > <maxRunningApps>1</maxRunningApps> > <!-- Number of seconds after which the pool can preempt other pools' > tasks to achieve its min share. Requires preemption to be enabled in > mapred-site.xml by setting mapred.fairscheduler.preemption to true. > Defaults to infinity (no preemption). --> > <minSharePreemptionTimeout>5</minSharePreemptionTimeout> > <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. --> > <weight>1.0</weight> > </queue> > > <!-- number of seconds a queue is under its fair share before it will try to > preempt containers to take resources from other queues. --> > <fairSharePreemptionTimeout>5</fairSharePreemptionTimeout> > </allocations> > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira