[jira] [Commented] (MAPREDUCE-5124) AM lacks flow control for task events

Jason Lowe (JIRA) Fri, 10 Nov 2017 07:00:37 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247608#comment-16247608
 ]


Jason Lowe commented on MAPREDUCE-5124:
---------------------------------------

Thanks for the patch!

I'm not a fan of TaskAttemptListenerImpl knowing about TaskAttemptImpl.  It 
circumvents the TaskAttempt interface, and I don't think it's necessary.  The 
reason it wants to know about TaskAttemptImpl is so it can stash the status 
update there, but we could stash it other places which may be cleaner.

For example, TaskAttemptListenerImpl could keep a collection of status updates, 
and the status update event could point to where the listener is holding the 
event.  The object in the async event would need to be an AtomicReference or 
something fancier so the status update event can be updated while the async 
event is in flight.  The first thing the TaskAttemptImpl would do when 
receiving the status event is atomically swap the status reference with null.  
The listener can tell whether the attempt received the status by checking 
whether the previous reference was null when swapping it back in.

I don't think it's OK to simply clobber a previous status with a subsequent 
status.  For example, if the previous status has counters and the later status 
does not, we should preserve the counters from the previous status.  Similarly, 
if there are fetch failures reported in the previous stauts, those need to be 
copied into the subsequent status.  This will make atomic updates of the status 
trickier, and we may need some locking involved so we can atomically update the 
status to prevent the attempt trying to consume a status while the listener is 
in the process of coalescing it.

> AM lacks flow control for task events
> -------------------------------------
>
>                 Key: MAPREDUCE-5124
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5124
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.0.3-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Peter Bacsko
>         Attachments: MAPREDUCE-5124-CoalescingPOC-1.patch, 
> MAPREDUCE-5124-CoalescingPOC2.patch, MAPREDUCE-5124-proto.2.txt, 
> MAPREDUCE-5124-prototype.txt
>
>
> The AM does not have any flow control to limit the incoming rate of events 
> from tasks.  If the AM is unable to keep pace with the rate of incoming 
> events for a sufficient period of time then it will eventually exhaust the 
> heap and crash.  MAPREDUCE-5043 addressed a major bottleneck for event 
> processing, but the AM could still get behind if it's starved for CPU and/or 
> handling a very large job with tens of thousands of active tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-5124) AM lacks flow control for task events

Reply via email to