[ 
https://issues.apache.org/jira/browse/MAPREDUCE-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749020#action_12749020
 ] 

Matei Zaharia commented on MAPREDUCE-936:
-----------------------------------------

Hi Zheng,

For issue 1, the provided patch looks good. It might be nice to add a unit test 
for it though.

For issue 2, I believe the implementation of locality waits in MAPREDUCE-706 
has solved the issue. In that implementation, once a job has launched a 
non-local task, it can keep launching non-local tasks right away without 
further waits. However, if it ever manages to launch a local task again, it 
needs to wait to start launching non-local tasks. The reasoning for this is 
that maybe the job had just been unlucky earlier and still has lots of tasks 
left to launch, and we don't want it to stay stuck at the non-local level.

I think the locality wait code you guys are running at Facebook is much older 
than the one in MAPREDUCE-706, so it would be nice if you could upgrade to 
MAPREDUCE-706 when you upgrade Hadoop in general. I believe it would not be too 
difficult to port the trunk version of the fair scheduler to 0.20 and get all 
the architectural changes and improvements in 706 with that.

Matei

> Allow a load difference in fairshare scheduler
> ----------------------------------------------
>
>                 Key: MAPREDUCE-936
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-936
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/fair-share
>            Reporter: Zheng Shao
>         Attachments: MAPREDUCE-936.1.patch
>
>
> The problem we are facing: It takes a long time for all tasks of a job to get 
> scheduled on the cluster, even if the cluster is almost empty.
> There are two reasons that together lead to this situation:
> 1. The load factor makes sure each TT runs the same number of tasks. (This is 
> the part that this patch tries to change).
> 2. The scheduler tries to schedule map tasks locally (first node-local, then 
> rack-local). There is a wait time (mapred.fairscheduler.localitywait.node and 
> mapred.fairscheduler.localitywait.rack, both are around 10 sec in our conf), 
> and accumulated wait time (JobInfo.localityWait). The accumulated wait time 
> is reset to 0 whenever a non-local map task is scheduled. That means it takes 
> N * wait_time to schedule N non-local map tasks.
> Because of 1, a lot of TT will not be able to take more tasks, even if they 
> have free slots. As a result, a lot of the map tasks cannot be scheduled 
> locally.
> Because of 2, it's really hard to schedule a non-local task.
> As a result, sometimes we are seeing that it takes more than 2 minutes to 
> schedule all the mappers of a job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to