[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212598#comment-16212598
 ] 

Jason Lowe commented on MAPREDUCE-5124:
---------------------------------------

bq. Could you give me the location of the code that handles this situation in 
the Resource Manager?

It's between RMNodeImpl and the scheduler.  RMNodeImpl's 
StatusUpdateWhenHealthyTransition used to blindly send a scheduler event for 
every node heartbeat, but now it queues up the container updates and only sends 
an event if it knows there isn't one already pending.  The scheduler then pulls 
all container updates from the queue when it receives a node status update 
event.  See the handling of nextHeartBeat and nodeUpdateQueue in RMNodeImpl.

bq. Are you suggesting to update TaskAttemptImpl.reportedStatus directly?

Maybe, or just have a separate field representing the pending status that the 
TaskAttempt will reference when receiving a status update event.  I haven't 
looked at it closely enough to see if it makes more sense to update 
reportedStatus directly rather than keep it separate until the status update 
event is processed.  There is a bit of processing that occurs when updating the 
status, so TaskAttemptImpl would need some way to know whether or not that has 
already occurred when a status update event arrives.  Keeping it separate makes 
it straightforward to distinguish.

bq. We'd still put the same number of TaskAttemptStatusUpdateEvent to the 
dispatcher's queue, so with a lot of tasks, it could still put a considerable 
stress on the AM.

Again the vast majority of the stress is because of the _payload_ of all those 
events.  We'd be eliminating that overhead, and processing a status event when 
there's no pending status payload to update would be very fast.  So the end 
result would be a much smaller heap (i.e.: much less GC pressure which leads to 
the vicious cycle) and faster processing of the extra status update events 
(i.e.: the AM can catch up much faster).

Although now that I think about it, we can also eliminate the extra status 
events by doing something similar to the RMNodeImpl-YarnScheduler interaction.  
If we know when we are coalescing task status updates then we know that a 
previous status update has not been processed yet, so we can avoid sending a 
redundant status update event.  That will eliminate both the payload overhead 
and the event overhead.


> AM lacks flow control for task events
> -------------------------------------
>
>                 Key: MAPREDUCE-5124
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5124
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.0.3-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Haibo Chen
>         Attachments: MAPREDUCE-5124-proto.2.txt, MAPREDUCE-5124-prototype.txt
>
>
> The AM does not have any flow control to limit the incoming rate of events 
> from tasks.  If the AM is unable to keep pace with the rate of incoming 
> events for a sufficient period of time then it will eventually exhaust the 
> heap and crash.  MAPREDUCE-5043 addressed a major bottleneck for event 
> processing, but the AM could still get behind if it's starved for CPU and/or 
> handling a very large job with tens of thousands of active tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to