RPC queue overload of JobTracker
--------------------------------
Key: HADOOP-3813
URL: https://issues.apache.org/jira/browse/HADOOP-3813
Project: Hadoop Core
Issue Type: Bug
Components: mapred
Affects Versions: 0.17.1
Reporter: Christian Kunz
On a cluster with about 1700 nodes, when a job with about 100,000 maps and
10,000 reduces completed, the JobTracker, even with 80 handlers, could not
handle the rpc call load during promotion of the job, such that at the end,
because of the discarded heartbeats, the JobTracker lost nearly all
TaskTrackers (about 10 TaskTrackers left). Promotion took more than 40 minutes.
They reconnected and everything recovered, but this might have been just luck.
Shouldn't there be an adaptive throttling of the rate in heartbeats and
TaskCompletionEvents?
Sample messsages:
2008-07-22 18:21:55,831 WARN org.apache.hadoop.ipc.Server: Call queue overflow
discarding oldest call heartbeat([EMAIL PROTECTED], false, true, 18137) from xxx
2008-07-22 18:21:55,834WARN org.apache.hadoop.ipc.Server: Call queue overflow
discarding oldest call getTaskCompletionEvents(job_200807190635_0012, 119567,
50) from yyy
...
2008-07-22 19:02:28,821 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1
on 9020, call heartbeat([EMAIL PROTECTED], false, true, 18199) from zzz:
discarded for being too old (40936)
2008-07-22 19:02:28,821 WARN org.apache.hadoop.ipc.Server: IPC Server handler
34 on 9020, call getTaskCompletionEvents(job_200807190635_0012, 119567, 50)
from uuu: discarded for being too old (40978)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.