[ 
https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated HADOOP-4665:
----------------------------------

    Attachment: hadoop-4665-v5.patch

Here's a new patch. It incorporates Vinod's comments, except for changing the 
event log format to key-value pairs. The event log in its current state is 
meant to be used only for debugging and has a relatively simple tab-separated 
format, so I didn't want to complicate the API to it. We can have another JIRA 
for adding a parser class for it and turning it into more than a debug tool if 
there's demand for that. It also adds five unit tests for preemption and a 
default config file for the fair scheduler. The patch also makes the update 
interval configurable (since I was did that with the other two periodic check 
intervals).

Included in this patch is a fairly significant evolution of the fair scheduler 
unit testing framework, which adds tracking of tasks in FakeJobInProgress to 
allow for preemption to be tested meaningfully. The FakeJobInProgress and 
FakeTaskInProgress have some commonalities with the ones in the capacity 
scheduler, but unfortunately I wasn't able to use those directly because some 
of the classes used in the fair scheduler, such as Clock, don't have 
equivalents there. It would be nice to create a common testing framework for 
schedulers, but that should be a separate JIRA. I also think that the ultimate 
solution for that is not to make an elaborate FakeJobInProgress and related 
classes, but rather to make MiniMRCluster more user-friendly and switch all 
tests into it. We can also consider making the Clock class be used by all the 
MR code so that tests can run at an accelerated rate on these simulated 
clusters.

One other thing I will need to add in this patch is documentation for the 
preemption params. However, the other changes can be reviewed right now.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, 
> hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, 
> hadoop-4665-v4.patch, hadoop-4665-v5.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: 
> users might submit long-running tasks by mistake (e.g. an infinite loop in a 
> map program), or tasks may be long due to having to process large amounts of 
> data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity 
> for certain queues, as well as a goal of providing good performance for 
> interactive jobs on average through fair sharing. Therefore, it will support 
> preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 
> seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds 
> (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to 
> meet guarantees quickly. T2 is meant as a last resort in case non-critical 
> jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the 
> following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, 
> ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting 
> wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the 
> copy phase where there is not much map output per task (at Facebook, we have 
> observed this to be the main time we need preemption - when a job has a long 
> map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to