[jira] Issue Comment Edited: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Amareshwari Sri Ramadasu (JIRA) Tue, 13 Nov 2007 20:46:05 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542122
 ]


amareshwari edited comment on HADOOP-1900 at 11/13/07 8:43 PM:
----------------------------------------------------------------------------

With the patch attached, I ran sort benchmarks on 390 node cluster and 120 node 
cluster. The performance is almost the same as with the trunk.

To simulate busyness at the job tracker, I ran the sort benchmarks on 120 node 
cluster with number of handlers=4 and with max queue size per handler =10, but 
there are drops and lost task trackers with the patch and without. 

Thus Cluster size factor as (clusterSize/50+1) should be fine. But the 
busyFactor has to be better tuned.

I propose the following to tune busy factor:

We have threshouldDropCount = clusterSize/10;

We increment busyFactor by  HEARTBEAT_BUSY_FACTOR (say 2secs) for every 10% 
cluster size drops.
if(dropCount > threshouldDropCount) {
  busyFactor += (dropCount/threshouldDropCount)*HEARTBEAT_BUSY_FACTOR;
}

For example, on a 100 node cluster, if we see 40 drops, busyFactor is 
incremented by 8 seconds (40/10*2).

If job tracker is not busy for 'observationInterval' , then we will decrement 
busyFactor by HEARTBEAT_BUSY_FACTOR;

To calculate observationInterval, 
We have,  2 rpcs to be processed as at the jobtracker i.e. heartbeat and task 
completion events. let processing time per rpc be 2 seconds. 
Here, observationInterval is calculated as:
observationInterval = (clusterSize/#handlers)*processingTime*2;

Assuming that we don't see drops at a certain observationInterval (and the 
corresponding busyFactor), we decrement the busyFactor by 
HEARTBEAT_BUSY_FACTOR. This can be done in a loop, until we see drops. When we 
see drops, we increment it by the constant HEARTBEAT_BUSY_FACTOR, and stabilize 
there .. until we see drops. 

For example, On a 100 on cluster, We start with 2 seconds heartbeat interval.
We see 40 drops, then busyFactor = 8; then, new interval = (2+8) =10;
We dont see drops for 40 seconds;  new interval = 10-2 =8;
We dont see drops for 40 seconds;  new interval = 8-2 =6;
We dont see drops for 40 seconds;  new interval = 6-2 =4;
We see drops; then new interval = 6;
We dont see drops for lone time, say. we stabilize here.
Say  we see 30 drops after some time, busyFactor =6; new interval = 6+6 =12;
And the loop repeats.

Thoughts?

      was (Author: amareshwari):
    With the patch attached, I ran sort benchmarks on 390 node cluster and 120 
node cluster. The performance is almost the same as with the trunk.

To simulate busyness at the job tracker, I ran the sort benchmarks on 120 node 
cluster with number of handlers=4 and with max queue size per handler =10, but 
there are drops and lost task trackers with the patch and without. 

Thus Cluster size factor as (clusterSize/50+1) should be fine. But the 
busyFactor has to be better tuned.

I propose the following to tune busy factor:

We have threshouldDropCount = clusterSize/10;

We increment busyFactor by  HEARTBEAT_BUSY_FACTOR (say 2secs) for every 10% 
cluster size drops.
if(dropCount > threshouldDropCount) {
  busyFactor += (dropCount/threshouldDropCount)*HEARTBEAT_BUSY_FACTOR;
}

For example, on a 100 node cluster, if we see 40 drops, busyFactor is 
incremented by 8 seconds (40/10*2).

If job tracker is not busy for 'observationInterval' , then we will decrement 
busyFactor by HEARTBEAT_BUSY_FACTOR;

To calculate observationInterval, 
We have,  2 rpcs to be processed as at the jobtracker i.e. heartbeat and task 
completion events. let processing time per rpc be 2 seconds. 
Here, notBusyPeriod is calculated as:
notBusyPeriod = (clusterSize/#handlers)*processingTime*2;

Assuming that we don't see drops at a certain observationInterval (and the 
corresponding busyFactor), we decrement the busyFactor by 
HEARTBEAT_BUSY_FACTOR. This can be done in a loop, until we see drops. When we 
see drops, we increment it by the constant HEARTBEAT_BUSY_FACTOR, and stabilize 
there .. until we see drops. 

For example, On a 100 on cluster, We start with 2 seconds heartbeat interval.
We see 40 drops, then busyFactor = 8; then, new interval = (2+8) =10;
We dont see drops for 40 seconds;  new interval = 10-2 =8;
We dont see drops for 40 seconds;  new interval = 8-2 =6;
We dont see drops for 40 seconds;  new interval = 6-2 =4;
We see drops; then new interval = 6;
We dont see drops for lone time, say. we stabilize here.
Say  we see 30 drops after some time, busyFactor =6; new interval = 6+6 =12;
And the loop repeats.

Thoughts?
  
> the heartbeat and task event queries interval should be set dynamically by 
> the JobTracker
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1900
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sri Ramadasu
>         Attachments: patch-1900.txt, patch-1900.txt
>
>
> The JobTracker should scale the intervals that the TaskTrackers use to 
> contact it dynamically, based on how the busy it is and the size of the 
> cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1900) the heartbeat and task event queries interval should be set dynamically by the JobTracker

Reply via email to