[ https://issues.apache.org/jira/browse/YARN-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977886#comment-13977886 ]
Wangda Tan commented on YARN-1842: ---------------------------------- Took a look at this, I'm wondering if it's caused by this case 1) Client asked kill application, 2) After RM transferred application's state to killed, and before AM container actually killed by NM, the AM asked to finish application Since the RMAppAttempt already called AMS.unregisterAttempt, the attempt will be cleaned from cache, thus the InvalidApplicationMasterRequestException will be raised. I guess this after reading log uploaded by [~keyki], Still pretty good in following log, {code} 2014-03-18 19:36:50,802 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1395167286771_0002 State change from ACCEPTED to RUNNING 2014-03-18 19:36:52,534 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1395167286771_0002_01_000002 Container Transitioned from NEW to ALLOCATED 2014-03-18 19:36:52,534 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=keyki OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1395167286771_0002 CONTAINERID=container_1395167286771_0002_01_000002 2014-03-18 19:36:52,534 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: Assigned container container_1395167286771_0002_01_000002 of capacity <memory:1024, vCores:1> on host localhost:56214, which currently has 2 containers, <memory:2048, vCores:2> used and <memory:6144, vCores:6> available 2014-03-18 19:36:52,534 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: assignedContainer application=application_1395167286771_0002 container=Container: [ContainerId: container_1395167286771_0002_01_000002, NodeId: localhost:56214, NodeHttpAddress: localhost:8042, Resource: <memory:1024, vCores:1>, Priority: 1, Token: Token { kind: ContainerToken, service: 127.0.0.1:56214 }, ] containerId=container_1395167286771_0002_01_000002 queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:1024, vCores:1>usedCapacity=0.125, absoluteUsedCapacity=0.125, numApps=1, numContainers=1 usedCapacity=0.125 absoluteUsedCapacity=0.125 used=<memory:1024, vCores:1> cluster=<memory:8192, vCores:8> 2014-03-18 19:36:52,534 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2048, vCores:2>usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1, numContainers=2 2014-03-18 19:36:52,535 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 used=<memory:2048, vCores:2> cluster=<memory:8192, vCores:8> 2014-03-18 19:36:52,961 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1395167286771_0002_01_000002 Container Transitioned from ALLOCATED to ACQUIRED 2014-03-18 19:36:53,536 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1395167286771_0002_01_000002 Container Transitioned from ACQUIRED to RUNNING {code} Client asked kill application, and AMS.unregisterAttempt called, attempt will be removed from AMS cache {code} 2014-03-18 19:38:50,427 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=keyki IP=37.139.29.192 OPERATION=Kill Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1395167286771_0002 2014-03-18 19:38:50,427 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1395167286771_0002 2014-03-18 19:38:50,427 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1395167286771_0002 State change from RUNNING to KILLED 2014-03-18 19:38:50,428 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1395167286771_0002_000001 {code} After that, AM asked finishApplication, but unfortunately, attempt is already removed from cache {code} 2014-03-18 19:38:51,397 ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AppAttemptId doesnt exist in cache appattempt_1395167286771_0002_000001 2014-03-18 19:38:52,415 ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Application doesn't exist in cache appattempt_1395167286771_0002_000001 {code} I'm not sure if it's possible in current Hoya design, please correct me if I was wrong. > InvalidApplicationMasterRequestException raised during AM-requested shutdown > ---------------------------------------------------------------------------- > > Key: YARN-1842 > URL: https://issues.apache.org/jira/browse/YARN-1842 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.3.0 > Reporter: Steve Loughran > Priority: Minor > Attachments: hoyalogs.tar.gz > > > Report of the RM raising a stack trace > [https://gist.github.com/matyix/9596735] during AM-initiated shutdown. The AM > could just swallow this and exit, but it could be a sign of a race condition > YARN-side, or maybe just in the RM client code/AM dual signalling the > shutdown. > I haven't replicated this myself; maybe the stack will help track down the > problem. Otherwise: what is the policy YARN apps should adopt for AM's > handling errors on shutdown? go straight to an exit(-1)? -- This message was sent by Atlassian JIRA (v6.2#6252)