[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13657091#comment-13657091
 ] 

Robert Joseph Evans commented on MAPREDUCE-5124:
------------------------------------------------

Also I am very nervous about the possibilities of deadlocks if the 
AsyncDispatcher can block.  The simple case would be two threads in the async 
dispatcher.  The queue is almost full and some RPC calls come in that fill it 
up fully.  At the same time each of these two running threads also try to 
insert something into the dispatcher and block.  Now we cannot process anything 
anymore because there are no free threads to process data.

I would really prefer a model where the RPC changes and can respond with a try 
again later.  So that if the AM/RM is falling behind the RPC layer can detect 
that and throttle new events coming into the system.  We would need to also 
change some of the logic in the system so a node/task would not be declared as 
dead because the AM/RM was so far behind in events that it told the heart beats 
to try again later until it times out.  But that would probably just be a 
minimal event or even better a simple concurrent data structure.
                
> AM lacks flow control for task events
> -------------------------------------
>
>                 Key: MAPREDUCE-5124
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5124
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.0.3-alpha, 0.23.5
>            Reporter: Jason Lowe
>         Attachments: MAPREDUCE-5124-prototype.txt
>
>
> The AM does not have any flow control to limit the incoming rate of events 
> from tasks.  If the AM is unable to keep pace with the rate of incoming 
> events for a sufficient period of time then it will eventually exhaust the 
> heap and crash.  MAPREDUCE-5043 addressed a major bottleneck for event 
> processing, but the AM could still get behind if it's starved for CPU and/or 
> handling a very large job with tens of thousands of active tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to