[ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe updated MAPREDUCE-3738: ---------------------------------- Attachment: MAPREDUCE-3738.patch Patch to ensure we always set the finished boolean in the log aggregation thread. On a side note we haven't seen a reoccurrence of the OOM condition on the nodemanager, so we haven't been able to track down what caused it. > NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly > ---------------------------------------------------------------------------- > > Key: MAPREDUCE-3738 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager > Affects Versions: 0.23.1, 0.24.0 > Reporter: Jason Lowe > Assignee: Jason Lowe > Priority: Critical > Attachments: MAPREDUCE-3738.patch, livehistdump.txt > > > If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception > like OutOfMemoryError in the case I saw) then this will lead to a hang during > nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during > shutdown to make sure log aggregation has completed, and that method > internally waits for an atomic boolean to be set by the log aggregation > thread to indicate it has finished. Since the thread was killed off earlier > due to an uncaught exception, the boolean will never be set and the NM hangs > during shutdown repeating something like this every second in the log file: > 2012-01-25 22:20:56,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Waiting for aggregation to complete for application_1326848182580_2806 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira