[ 
https://issues.apache.org/jira/browse/HADOOP-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696289#action_12696289
 ] 

Khaled Elmeleegy commented on HADOOP-5632:
------------------------------------------

Methodology: I tried an experiment studying slot utilization in a hadoop 
cluster under heavy load. For the workload, I used three large web scan jobs 
from the gridmix with uncompressed input. On my experimental setup, the map 
runtime was about 20 seconds. Each job had 16,000 maps. Block size was 
configured to be 128MB. I used hadoop20. Each node was configured to have 3 map 
slots and two reduce slots. I varied the inter heartbeat interval and measured 
the corresponding slot utilization. The way I measured the slot utilization was 
by plotting a timeline of the number of slots used in one of the workers 
(tasktrackers). To do that, I used a tasktracker's log. One tasktracker, should 
be representative of all the rest since I am using a single cluster and a 
single type of jobs. Also, maps of one job are more that the whole number of 
slots available.

Results:

For 10 seconds inter heartbeat interval slot utilization was 62.4%
For 1 second inter heartbeat interval slot utilization was 93.7%
For 100 miliseconds inter heartbeat interval slot utilization was 99%


Comments: For a fixed inter heartbeat interval, the longer the map is the 
higher the utilization gets.

Explanation:

Reducing the inter heatbeat interval to these miniscule values is unacceptable 
as in a big cluster the jobtracker to be overwhelmed with heartbeats. What 
should happen is to have tasktrackers send a heartbeat to the jobtracker asking 
for a new task as soon as a slot becomes open. This is instead of waiting for 
the periodic heartbeats. This currently doesn't happen as this could overload 
the jobtracker with heartbeats.

I found that the not only the heartbeat handler at the jobtracker is protected 
by a giant lock, which prevents the jobtracker from utilizing more than one CPU 
in the machine it's running on, but also that the cross-section of this lock is 
not constant. It's proportional to the load of the system.  For example, the 
heartbeat handler in the jobtracker calls functions like getTasksToKill(). This 
function scans a the list of tasks on the tasktracker to check if any tasks 
should be killed. This list can be arbitrary long. It keeps track of maps, of 
running jobs, that have run or are still running on this task tracker.  This is 
not scalable.

I observed in the above experiement that the runtime of a single heartbeat 
handler can go up to 3 ms. About half of this time is spent in getTasksToKill().

Solution:

1. Break the hearbeat handler into two kinds of handlers. A light weight one, 
which is called on demand by the tasktraker whenever a new slot comes open. And 
a heavy weight duty cycle handler that is called periodically with a fairly low 
frequency to do periodic things like kill tasks and jobs,... On the 
tasktracker. As soon as a new slot opens up at a tasktracker, it goes directly 
to the jobtracker asking for a new task. In this case, the light weight handler 
is invoked. Periodically, tasktrackers call the jobtracker for a duty cycle.
2. try to have finer grain locking in the heartbeat handler to be able to 
utilize more CPUs in the machine to help with the heavy load.

Attached is a patch that creates these light weight and heavy weight heartbeat 
handlers. Using this patch, I was able to get 99+% slot utilization and reduced 
the heartbeat runtime to 1 ms.

To do:
1- The light weight handler can get lighter. Currently, it does two main 
things, processHeartbeat(..) and assigntasks(..). ProcessHeartbeat(..) takes 
more than half a ms. I think it's not required in the light weight handler. The 
only thing needed is task completion notification. So that needs to be stripped 
off of it. This can cut the lightweight handler's runtime by another half.
2-revisit the locking of the hearbeat handler.

> Jobtracker leaves tasktrackers underutilized
> --------------------------------------------
>
>                 Key: HADOOP-5632
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5632
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.19.0, 0.19.1, 0.20.0
>         Environment: 2x HT 2.8GHz Intel Xeon, 3GB RAM, 4x 250GB HD linux 
> boxes, 100 node cluster
>            Reporter: Khaled Elmeleegy
>             Fix For: 0.20.0
>
>
> For some workloads, the jobtracker doesn't keep all the slots utilized even 
> under heavy load.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to