[ 
https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645402#action_12645402
 ] 

Amar Kamat commented on HADOOP-4558:
------------------------------------

Looks like the structure for running tasks is maintained only if speculation is 
_ON_. Hence with speculation turned off we dont see any tasks getting killed. 
We have 3 options here
1. Maintain a list of running tasks per job in capacity scheduler and use that 
to kill tasks instead. The drawback of this approach is 
   - Scheduler will do the same book keeping as done by JIP
   - Scheduler now needs to know about task completions.

2. Maintain the list of running tasks irrespective of speculation. The only 
drawback of this approach is that this will modify the (framework) code path 
for jobs with speculation turned _OFF_ and hence will require benchmarking

3. For jobs with speculation turned _OFF_, we walk over the map structure, find 
out the least progressed maps and kill them. The benefit of this approach is 
that the framework code remains unchanged and there is no code duplication. The 
drawback is that this approach does a linear scan everytime.

Thoughts?

I am still investigating why the reclaim didnt happen as expected.

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after 
> the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 
> mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>         Attachments: 4558.1.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after 
> the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are 
> running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to