[
https://issues.apache.org/jira/browse/HADOOP-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12616417#action_12616417
]
acmurthy edited comment on HADOOP-3813 at 7/24/08 2:43 AM:
----------------------------------------------------------------
+1. This patch looks fine, the question is whether we need to do more to help
ease Christian's pain?
Christian - do you think you can use this patch/build and re-run this? If you
cannot do it right-away I propose we move it to hadoop-0.19. I'm ok committing
this as-is too. Thoughts?
TestCLI failure is unrelated to this patch - HADOOP-3809.
was (Author: acmurthy):
This patch looks fine, the question is whether we need to do more to help
ease Christian's pain?
Christian - do you think you can use this patch/build and re-run this? If you
cannot do it right-away I propose we move it to hadoop-0.19. Thoughts?
> RPC queue overload of JobTracker
> --------------------------------
>
> Key: HADOOP-3813
> URL: https://issues.apache.org/jira/browse/HADOOP-3813
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.17.1
> Reporter: Christian Kunz
> Assignee: Amareshwari Sriramadasu
> Fix For: 0.17.2, 0.18.0, 0.19.0
>
> Attachments: patch-3813-0.17.txt, patch-3813.txt
>
>
> On a cluster with about 1700 nodes, when a job with about 100,000 maps and
> 10,000 reduces completed, the JobTracker, even with 80 handlers, could not
> handle the rpc call load during promotion of the job, such that at the end,
> because of the discarded heartbeats, the JobTracker lost nearly all
> TaskTrackers (about 10 TaskTrackers left). Promotion took more than 40
> minutes.
> They reconnected and everything recovered, but this might have been just luck.
> Shouldn't there be an adaptive throttling of the rate in heartbeats and
> TaskCompletionEvents?
> Sample messsages:
> 2008-07-22 18:21:55,831 WARN org.apache.hadoop.ipc.Server: Call queue
> overflow discarding oldest call heartbeat([EMAIL PROTECTED], false, true,
> 18137) from xxx
> 2008-07-22 18:21:55,834WARN org.apache.hadoop.ipc.Server: Call queue overflow
> discarding oldest call getTaskCompletionEvents(job_200807190635_0012, 119567,
> 50) from yyy
> ...
> 2008-07-22 19:02:28,821 WARN org.apache.hadoop.ipc.Server: IPC Server handler
> 1 on 9020, call heartbeat([EMAIL PROTECTED], false, true, 18199) from zzz:
> discarded for being too old (40936)
> 2008-07-22 19:02:28,821 WARN org.apache.hadoop.ipc.Server: IPC Server handler
> 34 on 9020, call getTaskCompletionEvents(job_200807190635_0012, 119567, 50)
> from uuu: discarded for being too old (40978)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.