[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602796#comment-13602796
 ] 

Vitaly Kruglikov commented on MAPREDUCE-5068:
---------------------------------------------

[~sandyr]
Hi Sandy, I believe I've seen some of your helpful posts on the cloudera hadoop 
google group. Excuse me if you're a different Sandy. I think that it would be 
good to have a test in hadoop for this scenario. I filed the issue here because 
CDH ultimately bundles hadoop components from Apache, so other Apache Hadoop 
users might be impacted. I will try CDH4.1.2. Should I file a second JIRA in 
https://issues.cloudera.org/browse/DISTRO in case CDH4.1.2 exhibits the same 
issue?

Thank you,
Vitaly
                
> Fair Scheduler preemption fails if the other queue has a mapreduce job with 
> some tasks in excess of cluster capacity
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5068
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5068
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, scheduler
>         Environment: Mac OS X; CDH4.1.2
>            Reporter: Vitaly Kruglikov
>              Labels: hadoop
>
> This is reliably reproduced while running CDH4.1.2 on a single Mac OS X 
> machine.
> # Two queues are being configured: cjmQ and slotsQ. Both queues are 
> configured with tiny minResources. The intention is for the task(s) of the 
> job in cjmQ to be able to preempt tasks of the job in slotsQ.
> # yarn.nodemanager.resource.memory-mb = 24576
> # First, a long-running 6-map-task (0 reducers) mapreduce job is started in 
> slotsQ with mapreduce.map.memory.mb=4096. Because MRAppMaster's container 
> consumes some memory, only 5 of its 6 map tasks are able to start, and the 
> 6th is pending, but will never run.
> # Then, a short-running 1-map-task (0 reducers) mapreduce job is submitted 
> via cjmQ with mapreduce.map.memory.mb=2048.
> Expected behavior:
> At this point, because the minimum share of cjmQ has not been met, I expected 
> Fair Scheduler to preempt one of the executing map tasks from the single 
> slotsQ mapreduce job to make room for the single map tasks of the cjmQ 
> mapreduce job. However, Fair Scheduler didn't preempt any of the running map 
> tasks of the slotsQ job. Instead, the cjmQ job was being starved perpetually. 
> Since slotsQ had far more than its minimum share allocated to it and already 
> running, while cjmQ was far below its minimum share (0 actually), Fair 
> Scheduler should have started preempting, regardless of there being one task 
> container from the slotsQ job (the 6th map container) that was not being 
> allocated.
> Additional useful info:
> # If I summit a second 1-map-task mapreduce job via cjmQ, the first cjmQ 
> mapreduce job in that Q gets scheduled and its state changes to RUNNING; once 
> that that first job completes, then the second job submitted via cjmQ gets 
> starved until a third job is submitted into cjmQ, and so on. This happens 
> regardless of the values of maxRunningApps in the queue configurations.
> # If, instead of requesting 6 map tasks for the slotsQ job, I only request 5 
> so that everything fits nicely into yarn.nodemanager.resource.memory-mb - 
> without that 6th pending, but not running task - then preemption works as I 
> would have expected. However, I cannot rely on this arrangement because in a 
> production cluster that is running at full capacity, if a machine dies, the 
> mapreduce job from slotsQ will request new containers for the failed tasks 
> and because the cluster was already at capacity, those containers will end up 
> as pending and will never run, recreating my original scenario of the 
> starving cjmQ job.
> # I initially wrote this up on 
> https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/0zv62pkN5lM,
>  so it would be good to update that group with the resolution.
> Configuration:
> In yarn-site.xml:
> {code}
>   <property>
>     <description>Scheduler plug-in class to use instead of the default 
> scheduler.</description>
>     <name>yarn.resourcemanager.scheduler.class</name>
>     
> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>   </property>
> {code}
> fair-scheduler.xml:
> {code}
> <configuration>
> <!-- Site specific FairScheduler configuration properties -->
>   <property>
>     <description>Absolute path to allocation file. An allocation file is an 
> XML
>     manifest describing queues and their properties, in addition to certain
>     policy defaults. This file must be in XML format as described in
>     
> http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
>     </description>
>     <name>yarn.scheduler.fair.allocation.file</name>
>     
> <value>[obfuscated]/current/conf/site/default/hadoop/fair-scheduler-allocations.xml</value>
>   </property>
>   <property>
>     <description>Whether to use preemption. Note that preemption is 
> experimental
>     in the current version. Defaults to false.</description>
>     <name>yarn.scheduler.fair.preemption</name>
>     <value>true</value>
>   </property>
>   <property>
>     <description>Whether to allow multiple container assignments in one
>     heartbeat. Defaults to false.</description>
>     <name>yarn.scheduler.fair.assignmultiple</name>
>     <value>true</value>
>   </property>
>   
> </configuration>
> {code}
> My fair-scheduler-allocations.xml:
> {code}
> <allocations>
>   <queue name="cjmQ">
>     <!-- minimum amount of aggregate memory; TODO which units??? -->
>     <minResources>2048</minResources>
>     <!-- limit the number of apps from the queue to run at once -->
>     <maxRunningApps>1</maxRunningApps>
>     
>     <!-- either "fifo" or "fair" depending on the in-queue scheduling policy
>     desired -->
>     <schedulingMode>fifo</schedulingMode>
>     <!-- Number of seconds after which the pool can preempt other pools'
>       tasks to achieve its min share. Requires preemption to be enabled in
>       mapred-site.xml by setting mapred.fairscheduler.preemption to true.
>       Defaults to infinity (no preemption). -->
>     <minSharePreemptionTimeout>5</minSharePreemptionTimeout>
>     <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
>     <weight>1.0</weight>
>   </queue>
>   <queue name="slotsQ">
>     <!-- minimum amount of aggregate memory; TODO which units??? -->
>     <minResources>1</minResources>
>     <!-- limit the number of apps from the queue to run at once -->
>     <maxRunningApps>1</maxRunningApps>
>     <!-- Number of seconds after which the pool can preempt other pools'
>       tasks to achieve its min share. Requires preemption to be enabled in
>       mapred-site.xml by setting mapred.fairscheduler.preemption to true.
>       Defaults to infinity (no preemption). -->
>     <minSharePreemptionTimeout>5</minSharePreemptionTimeout>
>     <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
>     <weight>1.0</weight>
>   </queue>
>   
>   <!-- number of seconds a queue is under its fair share before it will try to
>   preempt containers to take resources from other queues. -->
>   <fairSharePreemptionTimeout>5</fairSharePreemptionTimeout>
> </allocations>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to