RPC queue overload of JobTracker
--------------------------------

                 Key: HADOOP-3813
                 URL: https://issues.apache.org/jira/browse/HADOOP-3813
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.17.1
            Reporter: Christian Kunz


On a cluster with about 1700 nodes, when a job with about 100,000 maps and 
10,000 reduces completed, the JobTracker, even with 80 handlers, could not 
handle the rpc call load during promotion of the job, such that at the end, 
because of the discarded heartbeats, the JobTracker lost nearly all 
TaskTrackers (about 10 TaskTrackers left). Promotion took more than 40 minutes.
They reconnected and everything recovered, but this might have been just luck.
Shouldn't there be an adaptive throttling of the rate in heartbeats and 
TaskCompletionEvents?

Sample messsages:
2008-07-22 18:21:55,831 WARN org.apache.hadoop.ipc.Server: Call queue overflow 
discarding oldest call heartbeat([EMAIL PROTECTED], false, true, 18137) from xxx
2008-07-22 18:21:55,834WARN org.apache.hadoop.ipc.Server: Call queue overflow 
discarding oldest call getTaskCompletionEvents(job_200807190635_0012, 119567, 
50) from yyy
...
2008-07-22 19:02:28,821 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 9020, call heartbeat([EMAIL PROTECTED], false, true, 18199) from zzz: 
discarded for being too old (40936)
2008-07-22 19:02:28,821 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
34 on 9020, call getTaskCompletionEvents(job_200807190635_0012, 119567, 50) 
from uuu: discarded for being too old (40978)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to