[ https://issues.apache.org/jira/browse/HADOOP-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624684#action_12624684 ]
Hudson commented on HADOOP-3813: -------------------------------- Integrated in Hadoop-trunk #581 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/]) > RPC queue overload of JobTracker > -------------------------------- > > Key: HADOOP-3813 > URL: https://issues.apache.org/jira/browse/HADOOP-3813 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.17.1 > Reporter: Christian Kunz > Assignee: Amareshwari Sriramadasu > Fix For: 0.17.2 > > Attachments: patch-3813-0.17.txt, patch-3813-1.txt, patch-3813.txt > > > On a cluster with about 1700 nodes, when a job with about 100,000 maps and > 10,000 reduces completed, the JobTracker, even with 80 handlers, could not > handle the rpc call load during promotion of the job, such that at the end, > because of the discarded heartbeats, the JobTracker lost nearly all > TaskTrackers (about 10 TaskTrackers left). Promotion took more than 40 > minutes. > They reconnected and everything recovered, but this might have been just luck. > Shouldn't there be an adaptive throttling of the rate in heartbeats and > TaskCompletionEvents? > Sample messsages: > 2008-07-22 18:21:55,831 WARN org.apache.hadoop.ipc.Server: Call queue > overflow discarding oldest call heartbeat([EMAIL PROTECTED], false, true, > 18137) from xxx > 2008-07-22 18:21:55,834WARN org.apache.hadoop.ipc.Server: Call queue overflow > discarding oldest call getTaskCompletionEvents(job_200807190635_0012, 119567, > 50) from yyy > ... > 2008-07-22 19:02:28,821 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 1 on 9020, call heartbeat([EMAIL PROTECTED], false, true, 18199) from zzz: > discarded for being too old (40936) > 2008-07-22 19:02:28,821 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 34 on 9020, call getTaskCompletionEvents(job_200807190635_0012, 119567, 50) > from uuu: discarded for being too old (40978) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.