[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154108#comment-16154108
 ] 

Jason Lowe commented on MAPREDUCE-5124:
---------------------------------------

bq. I think either the server needs to control the heartbeat to minimize the 
delay (indeed a too big a change), or the task needs to tweak the heartbeat 
interval based on the previous response time as Peter Bacsko has suggested.

The issue here isn't that tasks are seeing a long delay in heartbeat response 
time and failing to react to that.  The problem is the AM is accepting and 
quickly responding to them at a rate far higher than it can actually process 
them in the background AsyncDispatcher thread.  In other words, by the time a 
task notices a significant delay in heartbeat processing time the AM has 
probably already started going into GC hell and it's likely too late to 
course-correct at that point.  The only way to get reliable feedback on how 
long the processing is really taking is to make the heartbeat processing 
synchronous, so the task doesn't get a response until the processing has 
actually completed.  Without async RPC call support, that has the issue of 
tying up the server handler threads which prevents more important calls from 
being processed in a timely manner.

> AM lacks flow control for task events
> -------------------------------------
>
>                 Key: MAPREDUCE-5124
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5124
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.0.3-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Haibo Chen
>         Attachments: MAPREDUCE-5124-proto.2.txt, MAPREDUCE-5124-prototype.txt
>
>
> The AM does not have any flow control to limit the incoming rate of events 
> from tasks.  If the AM is unable to keep pace with the rate of incoming 
> events for a sufficient period of time then it will eventually exhaust the 
> heap and crash.  MAPREDUCE-5043 addressed a major bottleneck for event 
> processing, but the AM could still get behind if it's starved for CPU and/or 
> handling a very large job with tens of thousands of active tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to