[ https://issues.apache.org/jira/browse/YARN-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977446#comment-13977446 ]
Brian Murphy commented on YARN-1842: ------------------------------------ Hey there, We are seeing this bug occur while shutting down Samza containers as well. We are running Hadoop 2.3.0 on Ubuntu 12.10. The container hangs indefinitely in the KILLING state. Here is the stack trace: {code} 2014-04-22 20:25:08 SamzaAppMaster$ [ERROR] Error occured in amClient's callback org.apache.samza.SamzaException: Received a reboot signal from the RM, so throwing an exception to reboot the AM. at org.apache.samza.job.yarn.SamzaAppMasterLifecycle.onReboot(SamzaAppMasterLifecycle.scala:59) at org.apache.samza.job.yarn.SamzaAppMaster$$anonfun$onShutdownRequest$1.apply(SamzaAppMaster.scala:136) at org.apache.samza.job.yarn.SamzaAppMaster$$anonfun$onShutdownRequest$1.apply(SamzaAppMaster.scala:136) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.samza.job.yarn.SamzaAppMaster$.onShutdownRequest(SamzaAppMaster.scala:136) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:285) 2014-04-22 20:25:09 ELContextCleaner [INFO] javax.el.BeanELResolver purged 2014-04-22 20:25:09 ContextHandler [INFO] stopped o.e.j.w.WebAppContext{/,jar:file:/mnt/data/hadoop/yarn/usercache/brian/appcache/application_1397507485520_0040/filecache/10/samza-job-package-0.7.0-dist.tar.gz/lib/samza-yarn_2.10-0.7.0.jar!/scalate} 2014-04-22 20:25:10 ELContextCleaner [INFO] javax.el.BeanELResolver purged 2014-04-22 20:25:10 ContextHandler [INFO] stopped o.e.j.w.WebAppContext{/,jar:file:/mnt/data/hadoop/yarn/usercache/brian/appcache/application_1397507485520_0040/filecache/10/samza-job-package-0.7.0-dist.tar.gz/lib/samza-yarn_2.10-0.7.0.jar!/scalate} 2014-04-22 20:25:10 SamzaAppMasterLifecycle [INFO] Shutting down. 2014-04-22 20:25:10 SamzaAppMaster$ [WARN] Listener org.apache.samza.job.yarn.SamzaAppMasterLifecycle@3c9ead34 failed to shutdown. org.apache.hadoop.yarn.exceptions.InvalidApplicationMasterRequestException: Application doesn't exist in cache appattempt_1397507485520_0040_000001 at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.throwApplicationDoesNotExistInCacheException(ApplicationMasterService.java:329) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:288) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.finishApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:75) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:97) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1958) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1956) {code} > InvalidApplicationMasterRequestException raised during AM-requested shutdown > ---------------------------------------------------------------------------- > > Key: YARN-1842 > URL: https://issues.apache.org/jira/browse/YARN-1842 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.3.0 > Reporter: Steve Loughran > Priority: Minor > Attachments: hoyalogs.tar.gz > > > Report of the RM raising a stack trace > [https://gist.github.com/matyix/9596735] during AM-initiated shutdown. The AM > could just swallow this and exit, but it could be a sign of a race condition > YARN-side, or maybe just in the RM client code/AM dual signalling the > shutdown. > I haven't replicated this myself; maybe the stack will help track down the > problem. Otherwise: what is the policy YARN apps should adopt for AM's > handling errors on shutdown? go straight to an exit(-1)? -- This message was sent by Atlassian JIRA (v6.2#6252)