[ https://issues.apache.org/jira/browse/MAPREDUCE-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212598#comment-16212598 ]
Jason Lowe commented on MAPREDUCE-5124: --------------------------------------- bq. Could you give me the location of the code that handles this situation in the Resource Manager? It's between RMNodeImpl and the scheduler. RMNodeImpl's StatusUpdateWhenHealthyTransition used to blindly send a scheduler event for every node heartbeat, but now it queues up the container updates and only sends an event if it knows there isn't one already pending. The scheduler then pulls all container updates from the queue when it receives a node status update event. See the handling of nextHeartBeat and nodeUpdateQueue in RMNodeImpl. bq. Are you suggesting to update TaskAttemptImpl.reportedStatus directly? Maybe, or just have a separate field representing the pending status that the TaskAttempt will reference when receiving a status update event. I haven't looked at it closely enough to see if it makes more sense to update reportedStatus directly rather than keep it separate until the status update event is processed. There is a bit of processing that occurs when updating the status, so TaskAttemptImpl would need some way to know whether or not that has already occurred when a status update event arrives. Keeping it separate makes it straightforward to distinguish. bq. We'd still put the same number of TaskAttemptStatusUpdateEvent to the dispatcher's queue, so with a lot of tasks, it could still put a considerable stress on the AM. Again the vast majority of the stress is because of the _payload_ of all those events. We'd be eliminating that overhead, and processing a status event when there's no pending status payload to update would be very fast. So the end result would be a much smaller heap (i.e.: much less GC pressure which leads to the vicious cycle) and faster processing of the extra status update events (i.e.: the AM can catch up much faster). Although now that I think about it, we can also eliminate the extra status events by doing something similar to the RMNodeImpl-YarnScheduler interaction. If we know when we are coalescing task status updates then we know that a previous status update has not been processed yet, so we can avoid sending a redundant status update event. That will eliminate both the payload overhead and the event overhead. > AM lacks flow control for task events > ------------------------------------- > > Key: MAPREDUCE-5124 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5124 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am > Affects Versions: 2.0.3-alpha, 0.23.5 > Reporter: Jason Lowe > Assignee: Haibo Chen > Attachments: MAPREDUCE-5124-proto.2.txt, MAPREDUCE-5124-prototype.txt > > > The AM does not have any flow control to limit the incoming rate of events > from tasks. If the AM is unable to keep pace with the rate of incoming > events for a sufficient period of time then it will eventually exhaust the > heap and crash. MAPREDUCE-5043 addressed a major bottleneck for event > processing, but the AM could still get behind if it's starved for CPU and/or > handling a very large job with tens of thousands of active tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org