[ https://issues.apache.org/jira/browse/HADOOP-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542122 ]
amareshwari edited comment on HADOOP-1900 at 11/13/07 8:43 PM: ---------------------------------------------------------------------------- With the patch attached, I ran sort benchmarks on 390 node cluster and 120 node cluster. The performance is almost the same as with the trunk. To simulate busyness at the job tracker, I ran the sort benchmarks on 120 node cluster with number of handlers=4 and with max queue size per handler =10, but there are drops and lost task trackers with the patch and without. Thus Cluster size factor as (clusterSize/50+1) should be fine. But the busyFactor has to be better tuned. I propose the following to tune busy factor: We have threshouldDropCount = clusterSize/10; We increment busyFactor by HEARTBEAT_BUSY_FACTOR (say 2secs) for every 10% cluster size drops. if(dropCount > threshouldDropCount) { busyFactor += (dropCount/threshouldDropCount)*HEARTBEAT_BUSY_FACTOR; } For example, on a 100 node cluster, if we see 40 drops, busyFactor is incremented by 8 seconds (40/10*2). If job tracker is not busy for 'observationInterval' , then we will decrement busyFactor by HEARTBEAT_BUSY_FACTOR; To calculate observationInterval, We have, 2 rpcs to be processed as at the jobtracker i.e. heartbeat and task completion events. let processing time per rpc be 2 seconds. Here, observationInterval is calculated as: observationInterval = (clusterSize/#handlers)*processingTime*2; Assuming that we don't see drops at a certain observationInterval (and the corresponding busyFactor), we decrement the busyFactor by HEARTBEAT_BUSY_FACTOR. This can be done in a loop, until we see drops. When we see drops, we increment it by the constant HEARTBEAT_BUSY_FACTOR, and stabilize there .. until we see drops. For example, On a 100 on cluster, We start with 2 seconds heartbeat interval. We see 40 drops, then busyFactor = 8; then, new interval = (2+8) =10; We dont see drops for 40 seconds; new interval = 10-2 =8; We dont see drops for 40 seconds; new interval = 8-2 =6; We dont see drops for 40 seconds; new interval = 6-2 =4; We see drops; then new interval = 6; We dont see drops for lone time, say. we stabilize here. Say we see 30 drops after some time, busyFactor =6; new interval = 6+6 =12; And the loop repeats. Thoughts? was (Author: amareshwari): With the patch attached, I ran sort benchmarks on 390 node cluster and 120 node cluster. The performance is almost the same as with the trunk. To simulate busyness at the job tracker, I ran the sort benchmarks on 120 node cluster with number of handlers=4 and with max queue size per handler =10, but there are drops and lost task trackers with the patch and without. Thus Cluster size factor as (clusterSize/50+1) should be fine. But the busyFactor has to be better tuned. I propose the following to tune busy factor: We have threshouldDropCount = clusterSize/10; We increment busyFactor by HEARTBEAT_BUSY_FACTOR (say 2secs) for every 10% cluster size drops. if(dropCount > threshouldDropCount) { busyFactor += (dropCount/threshouldDropCount)*HEARTBEAT_BUSY_FACTOR; } For example, on a 100 node cluster, if we see 40 drops, busyFactor is incremented by 8 seconds (40/10*2). If job tracker is not busy for 'observationInterval' , then we will decrement busyFactor by HEARTBEAT_BUSY_FACTOR; To calculate observationInterval, We have, 2 rpcs to be processed as at the jobtracker i.e. heartbeat and task completion events. let processing time per rpc be 2 seconds. Here, notBusyPeriod is calculated as: notBusyPeriod = (clusterSize/#handlers)*processingTime*2; Assuming that we don't see drops at a certain observationInterval (and the corresponding busyFactor), we decrement the busyFactor by HEARTBEAT_BUSY_FACTOR. This can be done in a loop, until we see drops. When we see drops, we increment it by the constant HEARTBEAT_BUSY_FACTOR, and stabilize there .. until we see drops. For example, On a 100 on cluster, We start with 2 seconds heartbeat interval. We see 40 drops, then busyFactor = 8; then, new interval = (2+8) =10; We dont see drops for 40 seconds; new interval = 10-2 =8; We dont see drops for 40 seconds; new interval = 8-2 =6; We dont see drops for 40 seconds; new interval = 6-2 =4; We see drops; then new interval = 6; We dont see drops for lone time, say. we stabilize here. Say we see 30 drops after some time, busyFactor =6; new interval = 6+6 =12; And the loop repeats. Thoughts? > the heartbeat and task event queries interval should be set dynamically by > the JobTracker > ----------------------------------------------------------------------------------------- > > Key: HADOOP-1900 > URL: https://issues.apache.org/jira/browse/HADOOP-1900 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Reporter: Owen O'Malley > Assignee: Amareshwari Sri Ramadasu > Attachments: patch-1900.txt, patch-1900.txt > > > The JobTracker should scale the intervals that the TaskTrackers use to > contact it dynamically, based on how the busy it is and the size of the > cluster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.