[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478496#comment-13478496
 ] 

Jason Lowe commented on MAPREDUCE-4730:
---------------------------------------

Here's what I have gathered so far from a heap dump of an AM attempt that was 
just about to run out of memory.  Most of the memory was consumed by byte 
buffers, specifically it looked like most of them were RPC response buffers.

I think there might be a flow control issue in the IPC layer that lead to this. 
 More than half of the mappers finished before the first reducer started, and 
thousands of reducers all launched within a few seconds of each other.  They 
all asked the AM for map completion task events, which currently caps the 
response to 10000 events per query.  Since more than 10000 maps completed 
before the first reducers started, each reducer saw a full event list which 
took around 900K for each response buffer.  There were many IPC Handler threads 
to service the calls, but only one Responder thread to send out the rather 
large response buffers.  I see there's a blocking queue to prevent too many 
calls from coming in at once, but I didn't see any flow control between the 
Handlers and the Responder thread.  It appears that as long as the Handler 
threads can keep up with call queue relatively low, they can continue to queue 
up call response data faster than the Responder thread can send it out.  
Eventually this will exhaust available memory leading to an OOM.
                
> AM crashes due to OOM while serving up map task completion events
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4730
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>            Priority: Blocker
>
> We're seeing a repeatable OOM crash in the AM for a task with around 30000 
> maps and 3000 reducers.  Details to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to