[ 
https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694524#action_12694524
 ] 

Vinod K V commented on HADOOP-4665:
-----------------------------------

Here's a more comprehensive review:

Major points:
 - As it's a new feature, I think preemption(mapred.fairscheduler.preemption) 
should be disabled(false) by default.
 - Can we do something like above stated proposal: to preempt a task when a 
job's fairshare/minshare are not met within PREEMPTION_TIMEOUT leaving 2/3 
heartbeats for the task to actually get killed.
 - DUMP_INTERVAL and PREEMPTION_INTERVAL should be configurable. The variables 
themselves can be package-private instead of public.
 - The class FairSchedulerEventLog can just be package-private. So do all the 
methods inside - init, log, shutdown and isEnabled - they don't need to be 
public as of now.

 - FairScheduler.preemptTasksIfNecessary() method:
   -- This method does a Collections.reverse(jobs) after sorting the jobs. We 
can just traverse the list in reverse order to get the desired effect here.
   --  The check as to whether FairScheduler will do 
preemption(preemptionEnabled && !useFifo) is done deep inside - all the stats 
are calculated and then only preemption is skipped if not needed. Can we take 
this check, may be, to the beginning of preemptTasksIfNecessary() method or 
inside update() method itself.

 - FairScheduler.tasksToPreempt() method:
   -- The count tasksDueToFairShare seems to be calculated to see if full 
fair-share of slots are allotted or not instead of the advertised half of 
fairshare. I think this is a mistake as isStarvedForFairShare() is checking for 
half of fair-share. Or am I missing something?
   -- EventLogs in this method are a bit confusing if the job is short on both 
minshare and fairshare. In this case, it is roughly giving an impression that 
we are preempting twice. I think it would be clear to just log that the job is 
short by so many slots w.r.t minshare and w.r.t fairshare. And in the end, just 
before returning we can simply say the exact number of slots we are going to 
preempt.

 - FairScheduler.start() method:
   -- I think that the call loadMgr.setEventLog(eventLog) should be after 
eventLog is initialized with a new FairSchedulerEvenLog object.

 - TestFairScheduler.java:
  -- You have added setup and cleanup tasks creation code in initTasks() 
method. Is there any specific reason for doing this? In any case, much of it 
duplicates code from JobInProgress.initTasks(), if we really want, we can 
refactor this code into a new method say JobInProgress.createSpecialTasks().
  -- There are no new test-cases related to preemption. I think we should have 
one/some.

Minor nits:
 - Javadoc for getFairSharePreemptionTimeout() is incomplete. Also, if I am 
understanding correctly FairSharePreemptionTimeout is the same for all 
pools/jobs.
 - The xml tag for minSharePreemptionTimeouts is currently preemptionTimeout. 
It can better be minSharePreemptionTimeout.
 - update thread:  The log message "Failed to update fair share calculations" 
can better be "Exception in Update thread". This because,now , update thread 
does more than just updating calculations.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, 
> hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, 
> hadoop-4665-v4.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: 
> users might submit long-running tasks by mistake (e.g. an infinite loop in a 
> map program), or tasks may be long due to having to process large amounts of 
> data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity 
> for certain queues, as well as a goal of providing good performance for 
> interactive jobs on average through fair sharing. Therefore, it will support 
> preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 
> seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds 
> (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to 
> meet guarantees quickly. T2 is meant as a last resort in case non-critical 
> jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the 
> following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, 
> ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting 
> wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the 
> copy phase where there is not much map output per task (at Facebook, we have 
> observed this to be the main time we need preemption - when a job has a long 
> map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to