[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaly Kruglikov updated MAPREDUCE-5068:
----------------------------------------

    Description: 
This is reliably reproduced while running CDH4.1.2 on a single Mac OS X machine.

# Two queues are being configured: cjmQ and slotsQ. Both queues are configured 
with tiny minResources. The intention is for the task(s) of the job in cjmQ to 
be able to preempt tasks of the job in slotsQ.
# yarn.nodemanager.resource.memory-mb = 24576
# First, a long-running 6-map-task (0 reducers) mapreduce job is started in 
slotsQ with mapreduce.map.memory.mb=4096. Because MRAppMaster's container 
consumes some memory, only 5 of its 6 map tasks are able to start, and the 6th 
is pending, but will never run.
# Then, a short-running 1-map-task (0 reducers) mapreduce job is submitted via 
cjmQ with mapreduce.map.memory.mb=2048.

Expected behavior:
At this point, because the minimum share of cjmQ has not been met, I expected 
Fair Scheduler to preempt one of the executing map tasks from the single slotsQ 
mapreduce job to make room for the single map tasks of the cjmQ mapreduce job. 
However, Fair Scheduler didn't preempt any of the running map tasks of the 
slotsQ job. Instead, the cjmQ job was being starved perpetually. Since slotsQ 
had far more than its minimum share allocated to it and already running, while 
cjmQ was far below its minimum share (0 actually), Fair Scheduler should have 
started preempting, regardless of there being one task container from the 
slotsQ job (the 6th map container) that was not being allocated.

Additional useful info:
# If I summit a second 1-map-task mapreduce job via cjmQ, the first cjmQ 
mapreduce job in that Q gets scheduled and its state changes to RUNNING; once 
that that first job completes, then the second job submitted via cjmQ gets 
starved until a third job is submitted into cjmQ, and so on. This happens 
regardless of the values of maxRunningApps in the queue configurations.
# If, instead of requesting 6 map tasks for the slotsQ job, I only request 5 so 
that everything fits nicely into yarn.nodemanager.resource.memory-mb without 
that 6th pending, but not running task, then preemption works as I would have 
expected. However, I cannot rely on this arrangement because in a production 
cluster that is running at full capacity, if a machine dies, the mapreduce job 
from slotsQ will request new containers for the failed tasks and because the 
cluster was already at capacity, those containers will end up as pending and 
will never run, recreating my original scenario.
# I initially wrote this up on 
https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/0zv62pkN5lM,
 so it would be good to update that group with the resolution.

Configuration:

In yarn-site.xml:
{code}
  <property>
    <description>Scheduler plug-in class to use instead of the default 
scheduler.</description>
    <name>yarn.resourcemanager.scheduler.class</name>
    
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
{code}

fair-scheduler.xml:
{code}
<configuration>

<!-- Site specific FairScheduler configuration properties -->
  <property>
    <description>Absolute path to allocation file. An allocation file is an XML
    manifest describing queues and their properties, in addition to certain
    policy defaults. This file must be in XML format as described in
    
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
    </description>
    <name>yarn.scheduler.fair.allocation.file</name>
    
<value>[obfuscated]/current/conf/site/default/hadoop/fair-scheduler-allocations.xml</value>
  </property>

  <property>
    <description>Whether to use preemption. Note that preemption is experimental
    in the current version. Defaults to false.</description>
    <name>yarn.scheduler.fair.preemption</name>
    <value>true</value>
  </property>

  <property>
    <description>Whether to allow multiple container assignments in one
    heartbeat. Defaults to false.</description>
    <name>yarn.scheduler.fair.assignmultiple</name>
    <value>true</value>
  </property>
  
</configuration>
{code}

My fair-scheduler-allocations.xml:
{code}
<allocations>

  <queue name="cjmQ">
    <!-- minimum amount of aggregate memory; TODO which units??? -->
    <minResources>2048</minResources>

    <!-- limit the number of apps from the queue to run at once -->
    <maxRunningApps>1</maxRunningApps>
    
    <!-- either "fifo" or "fair" depending on the in-queue scheduling policy
    desired -->
    <schedulingMode>fifo</schedulingMode>

    <!-- Number of seconds after which the pool can preempt other pools'
      tasks to achieve its min share. Requires preemption to be enabled in
      mapred-site.xml by setting mapred.fairscheduler.preemption to true.
      Defaults to infinity (no preemption). -->
    <minSharePreemptionTimeout>5</minSharePreemptionTimeout>

    <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
    <weight>1.0</weight>
  </queue>

  <queue name="slotsQ">
    <!-- minimum amount of aggregate memory; TODO which units??? -->
    <minResources>1</minResources>

    <!-- limit the number of apps from the queue to run at once -->
    <maxRunningApps>1</maxRunningApps>

    <!-- Number of seconds after which the pool can preempt other pools'
      tasks to achieve its min share. Requires preemption to be enabled in
      mapred-site.xml by setting mapred.fairscheduler.preemption to true.
      Defaults to infinity (no preemption). -->
    <minSharePreemptionTimeout>5</minSharePreemptionTimeout>

    <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
    <weight>1.0</weight>
  </queue>
  
  <!-- number of seconds a queue is under its fair share before it will try to
  preempt containers to take resources from other queues. -->
  <fairSharePreemptionTimeout>5</fairSharePreemptionTimeout>

</allocations>

{code}

  was:
This is reliably reproduced while running CDH4.1.2 on a single Mac OS X machine.

# Two queues are being configured: cjmQ and slotsQ. Both queues are configured 
with tiny minResources. The intention is for the task(s) of the job in cjmQ to 
be able to preempt tasks of the job in slotsQ.
# yarn.nodemanager.resource.memory-mb = 24576
# First, a long-running 6-map-task (0 reducers) mapreduce job is started in 
slotsQ with mapreduce.map.memory.mb=4096. Because MRAppMaster's container 
consumes some memory, only 5 of its 6 map tasks are able to start, and the 6th 
is pending, but will never run.
# Then, a short-running 1-map-task (0 reducers) mapreduce job is submitted via 
cjmQ with mapreduce.map.memory.mb=2048.

Expected behavior:
At this point, because the minimum share of cjmQ has not been met, I expected 
Fair Scheduler to preempt one of the executing map tasks from the single slotsQ 
mapreduce job to make room for the single map tasks of the cjmQ mapreduce job. 
However, Fair Scheduler didn't preempt any of the running map tasks of the 
slotsQ job. Instead, the cjmQ jot was being starved perpetually.

Additional useful info:
# If I summit a second 1-map-task mapreduce job via cjmQ, the first cjmQ 
mapreduce job in that Q gets scheduled and its state changes to RUNNING; once 
that that first job completes, then the second job submitted via cjmQ gets 
starved until a third job is submitted into cjmQ, and so on. This happens 
regardless of the values of maxRunningApps in the queue configurations.
# If, instead of requesting 6 map tasks for the slotsQ job, I only request 5 so 
that everything fits nicely into yarn.nodemanager.resource.memory-mb without 
that 6th pending, but not running task, then preemption works as I would have 
expected. However, I cannot rely on this arrangement because in a production 
cluster that is running at full capacity, if a machine dies, the mapreduce job 
from slotsQ will request new containers for the failed tasks and because the 
cluster was already at capacity, those containers will end up as pending and 
will never run, recreating my original scenario.
# I initially wrote this up on 
https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/0zv62pkN5lM,
 so it would be good to update that group with the resolution.

Configuration:

In yarn-site.xml:
{code}
  <property>
    <description>Scheduler plug-in class to use instead of the default 
scheduler.</description>
    <name>yarn.resourcemanager.scheduler.class</name>
    
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
{code}

fair-scheduler.xml:
{code}
<configuration>

<!-- Site specific FairScheduler configuration properties -->
  <property>
    <description>Absolute path to allocation file. An allocation file is an XML
    manifest describing queues and their properties, in addition to certain
    policy defaults. This file must be in XML format as described in
    
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
    </description>
    <name>yarn.scheduler.fair.allocation.file</name>
    
<value>[obfuscated]/current/conf/site/default/hadoop/fair-scheduler-allocations.xml</value>
  </property>

  <property>
    <description>Whether to use preemption. Note that preemption is experimental
    in the current version. Defaults to false.</description>
    <name>yarn.scheduler.fair.preemption</name>
    <value>true</value>
  </property>

  <property>
    <description>Whether to allow multiple container assignments in one
    heartbeat. Defaults to false.</description>
    <name>yarn.scheduler.fair.assignmultiple</name>
    <value>true</value>
  </property>
  
</configuration>
{code}

My fair-scheduler-allocations.xml:
{code}
<allocations>

  <queue name="cjmQ">
    <!-- minimum amount of aggregate memory; TODO which units??? -->
    <minResources>2048</minResources>

    <!-- limit the number of apps from the queue to run at once -->
    <maxRunningApps>1</maxRunningApps>
    
    <!-- either "fifo" or "fair" depending on the in-queue scheduling policy
    desired -->
    <schedulingMode>fifo</schedulingMode>

    <!-- Number of seconds after which the pool can preempt other pools'
      tasks to achieve its min share. Requires preemption to be enabled in
      mapred-site.xml by setting mapred.fairscheduler.preemption to true.
      Defaults to infinity (no preemption). -->
    <minSharePreemptionTimeout>5</minSharePreemptionTimeout>

    <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
    <weight>1.0</weight>
  </queue>

  <queue name="slotsQ">
    <!-- minimum amount of aggregate memory; TODO which units??? -->
    <minResources>1</minResources>

    <!-- limit the number of apps from the queue to run at once -->
    <maxRunningApps>1</maxRunningApps>

    <!-- Number of seconds after which the pool can preempt other pools'
      tasks to achieve its min share. Requires preemption to be enabled in
      mapred-site.xml by setting mapred.fairscheduler.preemption to true.
      Defaults to infinity (no preemption). -->
    <minSharePreemptionTimeout>5</minSharePreemptionTimeout>

    <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
    <weight>1.0</weight>
  </queue>
  
  <!-- number of seconds a queue is under its fair share before it will try to
  preempt containers to take resources from other queues. -->
  <fairSharePreemptionTimeout>5</fairSharePreemptionTimeout>

</allocations>

{code}

    
> Fair Scheduler preemption fails if the other queue has a mapreduce job with 
> some tasks in excess of cluster capacity
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5068
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5068
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, scheduler
>         Environment: Mac OS X; CDH4.1.2
>            Reporter: Vitaly Kruglikov
>              Labels: hadoop
>
> This is reliably reproduced while running CDH4.1.2 on a single Mac OS X 
> machine.
> # Two queues are being configured: cjmQ and slotsQ. Both queues are 
> configured with tiny minResources. The intention is for the task(s) of the 
> job in cjmQ to be able to preempt tasks of the job in slotsQ.
> # yarn.nodemanager.resource.memory-mb = 24576
> # First, a long-running 6-map-task (0 reducers) mapreduce job is started in 
> slotsQ with mapreduce.map.memory.mb=4096. Because MRAppMaster's container 
> consumes some memory, only 5 of its 6 map tasks are able to start, and the 
> 6th is pending, but will never run.
> # Then, a short-running 1-map-task (0 reducers) mapreduce job is submitted 
> via cjmQ with mapreduce.map.memory.mb=2048.
> Expected behavior:
> At this point, because the minimum share of cjmQ has not been met, I expected 
> Fair Scheduler to preempt one of the executing map tasks from the single 
> slotsQ mapreduce job to make room for the single map tasks of the cjmQ 
> mapreduce job. However, Fair Scheduler didn't preempt any of the running map 
> tasks of the slotsQ job. Instead, the cjmQ job was being starved perpetually. 
> Since slotsQ had far more than its minimum share allocated to it and already 
> running, while cjmQ was far below its minimum share (0 actually), Fair 
> Scheduler should have started preempting, regardless of there being one task 
> container from the slotsQ job (the 6th map container) that was not being 
> allocated.
> Additional useful info:
> # If I summit a second 1-map-task mapreduce job via cjmQ, the first cjmQ 
> mapreduce job in that Q gets scheduled and its state changes to RUNNING; once 
> that that first job completes, then the second job submitted via cjmQ gets 
> starved until a third job is submitted into cjmQ, and so on. This happens 
> regardless of the values of maxRunningApps in the queue configurations.
> # If, instead of requesting 6 map tasks for the slotsQ job, I only request 5 
> so that everything fits nicely into yarn.nodemanager.resource.memory-mb 
> without that 6th pending, but not running task, then preemption works as I 
> would have expected. However, I cannot rely on this arrangement because in a 
> production cluster that is running at full capacity, if a machine dies, the 
> mapreduce job from slotsQ will request new containers for the failed tasks 
> and because the cluster was already at capacity, those containers will end up 
> as pending and will never run, recreating my original scenario.
> # I initially wrote this up on 
> https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/0zv62pkN5lM,
>  so it would be good to update that group with the resolution.
> Configuration:
> In yarn-site.xml:
> {code}
>   <property>
>     <description>Scheduler plug-in class to use instead of the default 
> scheduler.</description>
>     <name>yarn.resourcemanager.scheduler.class</name>
>     
> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>   </property>
> {code}
> fair-scheduler.xml:
> {code}
> <configuration>
> <!-- Site specific FairScheduler configuration properties -->
>   <property>
>     <description>Absolute path to allocation file. An allocation file is an 
> XML
>     manifest describing queues and their properties, in addition to certain
>     policy defaults. This file must be in XML format as described in
>     
> http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
>     </description>
>     <name>yarn.scheduler.fair.allocation.file</name>
>     
> <value>[obfuscated]/current/conf/site/default/hadoop/fair-scheduler-allocations.xml</value>
>   </property>
>   <property>
>     <description>Whether to use preemption. Note that preemption is 
> experimental
>     in the current version. Defaults to false.</description>
>     <name>yarn.scheduler.fair.preemption</name>
>     <value>true</value>
>   </property>
>   <property>
>     <description>Whether to allow multiple container assignments in one
>     heartbeat. Defaults to false.</description>
>     <name>yarn.scheduler.fair.assignmultiple</name>
>     <value>true</value>
>   </property>
>   
> </configuration>
> {code}
> My fair-scheduler-allocations.xml:
> {code}
> <allocations>
>   <queue name="cjmQ">
>     <!-- minimum amount of aggregate memory; TODO which units??? -->
>     <minResources>2048</minResources>
>     <!-- limit the number of apps from the queue to run at once -->
>     <maxRunningApps>1</maxRunningApps>
>     
>     <!-- either "fifo" or "fair" depending on the in-queue scheduling policy
>     desired -->
>     <schedulingMode>fifo</schedulingMode>
>     <!-- Number of seconds after which the pool can preempt other pools'
>       tasks to achieve its min share. Requires preemption to be enabled in
>       mapred-site.xml by setting mapred.fairscheduler.preemption to true.
>       Defaults to infinity (no preemption). -->
>     <minSharePreemptionTimeout>5</minSharePreemptionTimeout>
>     <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
>     <weight>1.0</weight>
>   </queue>
>   <queue name="slotsQ">
>     <!-- minimum amount of aggregate memory; TODO which units??? -->
>     <minResources>1</minResources>
>     <!-- limit the number of apps from the queue to run at once -->
>     <maxRunningApps>1</maxRunningApps>
>     <!-- Number of seconds after which the pool can preempt other pools'
>       tasks to achieve its min share. Requires preemption to be enabled in
>       mapred-site.xml by setting mapred.fairscheduler.preemption to true.
>       Defaults to infinity (no preemption). -->
>     <minSharePreemptionTimeout>5</minSharePreemptionTimeout>
>     <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
>     <weight>1.0</weight>
>   </queue>
>   
>   <!-- number of seconds a queue is under its fair share before it will try to
>   preempt containers to take resources from other queues. -->
>   <fairSharePreemptionTimeout>5</fairSharePreemptionTimeout>
> </allocations>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to