[ https://issues.apache.org/jira/browse/MAPREDUCE-4448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415567#comment-13415567 ]
Jason Lowe commented on MAPREDUCE-4448: --------------------------------------- Log from one of the crashes shown below. Note the error during log aggregation init on app startup that later leads to a fatal error when the app finishes. {noformat} [main]2012-07-13 20:35:21,019 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1342210962593_0007_01_000001 by user x [IPC Server handler 0 on 8041]2012-07-13 20:35:21,043 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Creating a new application reference for app application_1342210962593_0007 [IPC Server handler 0 on 8041]2012-07-13 20:35:21,050 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1342210962593_0007 transitioned from NEW to INITING [AsyncDispatcher event handler]2012-07-13 20:35:21,051 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1342210962593_0007_01_000001 to application application_1342210962593_0007 [AsyncDispatcher event handler]2012-07-13 20:35:21,062 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:x (auth:SIMPLE) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] [AsyncDispatcher event handler]2012-07-13 20:35:21,063 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] [AsyncDispatcher event handler]2012-07-13 20:35:21,063 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:x (auth:SIMPLE) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] [AsyncDispatcher event handler]2012-07-13 20:35:21,063 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: Failed to create user dir [hdfs://xx:8020/mapred/logs/x] while processing app application_1342210962593_0007 [AsyncDispatcher event handler]2012-07-13 20:35:21,064 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:x (auth:SIMPLE) cause:java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "xx/xx.xx.xx.xx"; destination host is: ""x":8020; [AsyncDispatcher event handler]2012-07-13 20:35:21,065 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1342210962593_0007 transitioned from INITING to FINISHING_CONTAINERS_WAIT [AsyncDispatcher event handler]2012-07-13 20:35:21,067 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1342210962593_0007_01_000001 transitioned from NEW to DONE [AsyncDispatcher event handler]2012-07-13 20:35:21,067 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1342210962593_0007_01_000001 from application application_1342210962593_0007 [AsyncDispatcher event handler]2012-07-13 20:35:21,069 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1342210962593_0007 transitioned from FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP [AsyncDispatcher event handler]2012-07-13 20:35:21,070 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread [AsyncDispatcher event handler]org.apache.hadoop.yarn.YarnException: Application is not initialized yet for container_1342210962593_0007_01_000001 at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:347) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:381) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:65) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) at java.lang.Thread.run(Thread.java:619) 2012-07-13 20:35:21,071 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.. [AsyncDispatcher event handler]2012-07-13 20:35:21,072 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher thread interrupted [AsyncDispatcher event handler]java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1996) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:69) at java.lang.Thread.run(Thread.java:619) 2012-07-13 20:35:21,072 INFO org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is stopped. [Thread-1]2012-07-13 20:35:21,073 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:8042 [Thread-1]2012-07-13 20:35:21,075 INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is stopped. [Thread-1]2012-07-13 20:35:21,075 INFO org.apache.hadoop.ipc.Server: Stopping server on 8041 [Thread-1]2012-07-13 20:35:21,076 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8041 [IPC Server listener on 8041]2012-07-13 20:35:21,077 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit [Thread-1]2012-07-13 20:35:21,077 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder [IPC Server Responder]2012-07-13 20:35:21,077 INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService is stopped. {noformat} The problem is that one application with a bad token can bring down every nodemanager that ran a container for it. MAPREDUCE-4302 fixed a similar crash when log aggregation failed to start, but it missed this crash in the cleanup case. > Nodemanager crashes upon application cleanup if aggregation failed to start > --------------------------------------------------------------------------- > > Key: MAPREDUCE-4448 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4448 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager > Affects Versions: 0.23.3, 2.0.1-alpha > Reporter: Jason Lowe > Assignee: Jason Lowe > Priority: Critical > > When log aggregation is enabled, the nodemanager can crash if log aggregation > for an application failed to start. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira