[ 
https://issues.apache.org/jira/browse/YARN-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765708#comment-13765708
 ] 

Zhijie Shen commented on YARN-1149:
-----------------------------------

bq. adding more transition to 
FINISHING_CONTAINERS_WAIT,APPLICATION_RESOURCES_CLEANINGUP and FINISHED.

Make sense. It seems that without these transitions, even the current code is 
at the risk of InvalidStateTransitonException. For example, if Application 
transits from INITING to FINISHING_CONTAINERS_WAIT on FINISH_APPLICATION, and 
then receives APPLICATION_INITED. 

bq.  When the applications start to shut down. It is very possible that there 
are another applications added into the context. 
setBlockNewContainerRequests(true);

+1

bq. Also do this at the ShutDown block and Resync block. For all old 
applications (which have already in context and have already received the 
FINISH_APPLICATION event), they will ignore the events since they are already 
in FINISHING_CONTAINERS_WAIT or APPLICATION_RESOURCES_CLEANINGUP. For all the 
newly added applications, when they receives the FINISH_APPLICATION event, they 
will start to shut down.

I'm not sure it's a good idea. It creates additional overhead, as the events 
will be sent every 1s. As the new requests will be blocked, what we need to 
prevent is that while stopService() is running, startContainers() shouldn't be 
on the half way, where the application is created but not added into the 
context. Then, if unfortunately CMgrCompletedAppsEvent is sent at this time, 
the newly created application will not be notified. Maybe we can use 
synchronization to prevent stopService() and startContainers() from being 
interpolated

                
> NM throws InvalidStateTransitonException: Invalid event: 
> APPLICATION_LOG_HANDLING_FINISHED at RUNNING
> -----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1149
>                 URL: https://issues.apache.org/jira/browse/YARN-1149
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Ramya Sunil
>            Assignee: Xuan Gong
>             Fix For: 2.1.1-beta
>
>         Attachments: YARN-1149.1.patch, YARN-1149.2.patch, YARN-1149.3.patch, 
> YARN-1149.4.patch
>
>
> When nodemanager receives a kill signal when an application has finished 
> execution but log aggregation has not kicked in, 
> InvalidStateTransitonException: Invalid event: 
> APPLICATION_LOG_HANDLING_FINISHED at RUNNING is thrown
> {noformat}
> 2013-08-25 20:45:00,875 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:finishLogAggregation(254)) - Application just 
> finished : application_1377459190746_0118
> 2013-08-25 20:45:00,876 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:uploadLogsForContainer(105)) - Starting aggregate 
> log-file for app application_1377459190746_0118 at 
> /app-logs/foo/logs/application_1377459190746_0118/<host>_45454.tmp
> 2013-08-25 20:45:00,876 INFO  logaggregation.LogAggregationService 
> (LogAggregationService.java:stopAggregators(151)) - Waiting for aggregation 
> to complete for application_1377459190746_0118
> 2013-08-25 20:45:00,891 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:uploadLogsForContainer(122)) - Uploading logs for 
> container container_1377459190746_0118_01_000004. Current good log dirs are 
> /tmp/yarn/local
> 2013-08-25 20:45:00,915 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doAppLogAggregation(182)) - Finished aggregate 
> log-file for app application_1377459190746_0118
> 2013-08-25 20:45:00,925 WARN  application.Application 
> (ApplicationImpl.java:handle(427)) - Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> APPLICATION_LOG_HANDLING_FINISHED at RUNNING
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>  
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:425)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:59)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:697)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:689)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)   
>         at java.lang.Thread.run(Thread.java:662)
> 2013-08-25 20:45:00,926 INFO  application.Application 
> (ApplicationImpl.java:handle(430)) - Application 
> application_1377459190746_0118 transitioned from RUNNING to null
> 2013-08-25 20:45:00,927 WARN  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(463)) - 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  is interrupted. Exiting.
> 2013-08-25 20:45:00,938 INFO  ipc.Server (Server.java:stop(2437)) - Stopping 
> server on 8040
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to