[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215612#comment-13215612 ] Hudson commented on MAPREDUCE-3738: --- Integrated in Hadoop-Mapreduce-0.23-Build #206 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/206/]) merge MAPREDUCE-3738 from trunk (Revision 1293061) Result = FAILURE sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java > NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly > > > Key: MAPREDUCE-3738 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager >Affects Versions: 0.23.1, 0.24.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Fix For: 0.23.2 > > Attachments: MAPREDUCE-3738.patch, livehistdump.txt > > > If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception > like OutOfMemoryError in the case I saw) then this will lead to a hang during > nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during > shutdown to make sure log aggregation has completed, and that method > internally waits for an atomic boolean to be set by the log aggregation > thread to indicate it has finished. Since the thread was killed off earlier > due to an uncaught exception, the boolean will never be set and the NM hangs > during shutdown repeating something like this every second in the log file: > 2012-01-25 22:20:56,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Waiting for aggregation to complete for application_1326848182580_2806 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215581#comment-13215581 ] Hudson commented on MAPREDUCE-3738: --- Integrated in Hadoop-Hdfs-0.23-Build #178 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/178/]) merge MAPREDUCE-3738 from trunk (Revision 1293061) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java > NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly > > > Key: MAPREDUCE-3738 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager >Affects Versions: 0.23.1, 0.24.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Fix For: 0.23.2 > > Attachments: MAPREDUCE-3738.patch, livehistdump.txt > > > If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception > like OutOfMemoryError in the case I saw) then this will lead to a hang during > nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during > shutdown to make sure log aggregation has completed, and that method > internally waits for an atomic boolean to be set by the log aggregation > thread to indicate it has finished. Since the thread was killed off earlier > due to an uncaught exception, the boolean will never be set and the NM hangs > during shutdown repeating something like this every second in the log file: > 2012-01-25 22:20:56,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Waiting for aggregation to complete for application_1326848182580_2806 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215335#comment-13215335 ] Hudson commented on MAPREDUCE-3738: --- Integrated in Hadoop-Mapreduce-0.23-Commit #589 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/589/]) merge MAPREDUCE-3738 from trunk (Revision 1293061) Result = ABORTED sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java > NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly > > > Key: MAPREDUCE-3738 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager >Affects Versions: 0.23.1, 0.24.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Fix For: 0.23.2 > > Attachments: MAPREDUCE-3738.patch, livehistdump.txt > > > If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception > like OutOfMemoryError in the case I saw) then this will lead to a hang during > nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during > shutdown to make sure log aggregation has completed, and that method > internally waits for an atomic boolean to be set by the log aggregation > thread to indicate it has finished. Since the thread was killed off earlier > due to an uncaught exception, the boolean will never be set and the NM hangs > during shutdown repeating something like this every second in the log file: > 2012-01-25 22:20:56,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Waiting for aggregation to complete for application_1326848182580_2806 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215325#comment-13215325 ] Hudson commented on MAPREDUCE-3738: --- Integrated in Hadoop-Hdfs-0.23-Commit #574 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/574/]) merge MAPREDUCE-3738 from trunk (Revision 1293061) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java > NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly > > > Key: MAPREDUCE-3738 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager >Affects Versions: 0.23.1, 0.24.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Fix For: 0.23.2 > > Attachments: MAPREDUCE-3738.patch, livehistdump.txt > > > If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception > like OutOfMemoryError in the case I saw) then this will lead to a hang during > nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during > shutdown to make sure log aggregation has completed, and that method > internally waits for an atomic boolean to be set by the log aggregation > thread to indicate it has finished. Since the thread was killed off earlier > due to an uncaught exception, the boolean will never be set and the NM hangs > during shutdown repeating something like this every second in the log file: > 2012-01-25 22:20:56,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Waiting for aggregation to complete for application_1326848182580_2806 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215320#comment-13215320 ] Hudson commented on MAPREDUCE-3738: --- Integrated in Hadoop-Common-0.23-Commit #587 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/587/]) merge MAPREDUCE-3738 from trunk (Revision 1293061) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java > NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly > > > Key: MAPREDUCE-3738 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager >Affects Versions: 0.23.1, 0.24.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Fix For: 0.23.2 > > Attachments: MAPREDUCE-3738.patch, livehistdump.txt > > > If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception > like OutOfMemoryError in the case I saw) then this will lead to a hang during > nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during > shutdown to make sure log aggregation has completed, and that method > internally waits for an atomic boolean to be set by the log aggregation > thread to indicate it has finished. Since the thread was killed off earlier > due to an uncaught exception, the boolean will never be set and the NM hangs > during shutdown repeating something like this every second in the log file: > 2012-01-25 22:20:56,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Waiting for aggregation to complete for application_1326848182580_2806 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215310#comment-13215310 ] Siddharth Seth commented on MAPREDUCE-3738: --- +1. Looks good. > NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly > > > Key: MAPREDUCE-3738 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager >Affects Versions: 0.23.1, 0.24.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: MAPREDUCE-3738.patch, livehistdump.txt > > > If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception > like OutOfMemoryError in the case I saw) then this will lead to a hang during > nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during > shutdown to make sure log aggregation has completed, and that method > internally waits for an atomic boolean to be set by the log aggregation > thread to indicate it has finished. Since the thread was killed off earlier > due to an uncaught exception, the boolean will never be set and the NM hangs > during shutdown repeating something like this every second in the log file: > 2012-01-25 22:20:56,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Waiting for aggregation to complete for application_1326848182580_2806 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215077#comment-13215077 ] Hadoop QA commented on MAPREDUCE-3738: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515804/MAPREDUCE-3738.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1917//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1917//console This message is automatically generated. > NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly > > > Key: MAPREDUCE-3738 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager >Affects Versions: 0.23.1, 0.24.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: MAPREDUCE-3738.patch, livehistdump.txt > > > If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception > like OutOfMemoryError in the case I saw) then this will lead to a hang during > nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during > shutdown to make sure log aggregation has completed, and that method > internally waits for an atomic boolean to be set by the log aggregation > thread to indicate it has finished. Since the thread was killed off earlier > due to an uncaught exception, the boolean will never be set and the NM hangs > during shutdown repeating something like this every second in the log file: > 2012-01-25 22:20:56,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Waiting for aggregation to complete for application_1326848182580_2806 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194207#comment-13194207 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-3738: Comment race;) Even the stack trace during OOM will help. > NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly > > > Key: MAPREDUCE-3738 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager >Affects Versions: 0.23.1, 0.24.0 >Reporter: Jason Lowe >Priority: Critical > > If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception > like OutOfMemoryError in the case I saw) then this will lead to a hang during > nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during > shutdown to make sure log aggregation has completed, and that method > internally waits for an atomic boolean to be set by the log aggregation > thread to indicate it has finished. Since the thread was killed off earlier > due to an uncaught exception, the boolean will never be set and the NM hangs > during shutdown repeating something like this every second in the log file: > 2012-01-25 22:20:56,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Waiting for aggregation to complete for application_1326848182580_2806 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194204#comment-13194204 ] Jason Lowe commented on MAPREDUCE-3738: --- No bug for the OOM yet, unfortunately cluster was re-deployed before grabbing a full heap dump. I do have the jmap -hist:live output from one of the nodemanagers but haven't had a chance to go through it yet to see if it helps pinpoint where the leak would be. > NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly > > > Key: MAPREDUCE-3738 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager >Affects Versions: 0.23.1, 0.24.0 >Reporter: Jason Lowe >Priority: Critical > > If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception > like OutOfMemoryError in the case I saw) then this will lead to a hang during > nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during > shutdown to make sure log aggregation has completed, and that method > internally waits for an atomic boolean to be set by the log aggregation > thread to indicate it has finished. Since the thread was killed off earlier > due to an uncaught exception, the boolean will never be set and the NM hangs > during shutdown repeating something like this every second in the log file: > 2012-01-25 22:20:56,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Waiting for aggregation to complete for application_1326848182580_2806 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194205#comment-13194205 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-3738: Originally when I wrote this, I had the same suspicion about the join. But later, I made sure all exceptions were caught and that the boolean gets set in all possible cases. OOM/errors are one thing that didn't occur to me. Can you debug as to why you ran into OOM ? We need to fix that definitely, irrespective of how we want to handle other errors. > NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly > > > Key: MAPREDUCE-3738 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager >Affects Versions: 0.23.1, 0.24.0 >Reporter: Jason Lowe >Priority: Critical > > If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception > like OutOfMemoryError in the case I saw) then this will lead to a hang during > nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during > shutdown to make sure log aggregation has completed, and that method > internally waits for an atomic boolean to be set by the log aggregation > thread to indicate it has finished. Since the thread was killed off earlier > due to an uncaught exception, the boolean will never be set and the NM hangs > during shutdown repeating something like this every second in the log file: > 2012-01-25 22:20:56,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Waiting for aggregation to complete for application_1326848182580_2806 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194196#comment-13194196 ] Mahadev konar commented on MAPREDUCE-3738: -- Jason, Is there a bug for OOM? What was the reason for that? > NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly > > > Key: MAPREDUCE-3738 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager >Affects Versions: 0.23.1, 0.24.0 >Reporter: Jason Lowe >Priority: Critical > > If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception > like OutOfMemoryError in the case I saw) then this will lead to a hang during > nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during > shutdown to make sure log aggregation has completed, and that method > internally waits for an atomic boolean to be set by the log aggregation > thread to indicate it has finished. Since the thread was killed off earlier > due to an uncaught exception, the boolean will never be set and the NM hangs > during shutdown repeating something like this every second in the log file: > 2012-01-25 22:20:56,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Waiting for aggregation to complete for application_1326848182580_2806 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira