[ https://issues.apache.org/jira/browse/YARN-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765048#comment-13765048 ]
Xuan Gong commented on YARN-1149: --------------------------------- New patch addresses several other issues: 1. adding more transition to FINISHING_CONTAINERS_WAIT,APPLICATION_RESOURCES_CLEANINGUP and FINISHED. 2. When the applications start to shut down. It is very possible that there are another applications added into the context. {code} setBlockNewContainerRequests(true); {code} Set this at the beginning of ContainerManager::serviceStop() to block any new container Requests {code} try { Thread.sleep(1000); this.handle( new CMgrCompletedAppsEvent(new ArrayList<ApplicationId>( applications.keySet()), CMgrCompletedAppsEvent.Reason.ON_SHUTDOWN)); } catch (InterruptedException ex) { LOG.warn("Interrupted while sleeping on applications finish on shutdown", ex); } {code} Also do this at the ShutDown block and Resync block. For all old applications (which have already in context and have already received the FINISH_APPLICATION event), they will ignore the events since they are already in FINISHING_CONTAINERS_WAIT or APPLICATION_RESOURCES_CLEANINGUP. For all the newly added applications, when they receives the FINISH_APPLICATION event, they will start to shut down. > NM throws InvalidStateTransitonException: Invalid event: > APPLICATION_LOG_HANDLING_FINISHED at RUNNING > ----------------------------------------------------------------------------------------------------- > > Key: YARN-1149 > URL: https://issues.apache.org/jira/browse/YARN-1149 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Ramya Sunil > Assignee: Xuan Gong > Fix For: 2.1.1-beta > > Attachments: YARN-1149.1.patch, YARN-1149.2.patch, YARN-1149.3.patch, > YARN-1149.4.patch > > > When nodemanager receives a kill signal when an application has finished > execution but log aggregation has not kicked in, > InvalidStateTransitonException: Invalid event: > APPLICATION_LOG_HANDLING_FINISHED at RUNNING is thrown > {noformat} > 2013-08-25 20:45:00,875 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:finishLogAggregation(254)) - Application just > finished : application_1377459190746_0118 > 2013-08-25 20:45:00,876 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:uploadLogsForContainer(105)) - Starting aggregate > log-file for app application_1377459190746_0118 at > /app-logs/foo/logs/application_1377459190746_0118/<host>_45454.tmp > 2013-08-25 20:45:00,876 INFO logaggregation.LogAggregationService > (LogAggregationService.java:stopAggregators(151)) - Waiting for aggregation > to complete for application_1377459190746_0118 > 2013-08-25 20:45:00,891 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:uploadLogsForContainer(122)) - Uploading logs for > container container_1377459190746_0118_01_000004. Current good log dirs are > /tmp/yarn/local > 2013-08-25 20:45:00,915 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doAppLogAggregation(182)) - Finished aggregate > log-file for app application_1377459190746_0118 > 2013-08-25 20:45:00,925 WARN application.Application > (ApplicationImpl.java:handle(427)) - Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > APPLICATION_LOG_HANDLING_FINISHED at RUNNING > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:425) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:59) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:697) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:689) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81) > at java.lang.Thread.run(Thread.java:662) > 2013-08-25 20:45:00,926 INFO application.Application > (ApplicationImpl.java:handle(430)) - Application > application_1377459190746_0118 transitioned from RUNNING to null > 2013-08-25 20:45:00,927 WARN monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(463)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl > is interrupted. Exiting. > 2013-08-25 20:45:00,938 INFO ipc.Server (Server.java:stop(2437)) - Stopping > server on 8040 > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira