[jira] [Commented] (YARN-1842) InvalidApplicationMasterRequestException raised during AM-requested shutdown

Wangda Tan (JIRA) Tue, 22 Apr 2014 23:19:27 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977886#comment-13977886
 ]


Wangda Tan commented on YARN-1842:
----------------------------------

Took a look at this, I'm wondering if it's caused by this case
1) Client asked kill application, 
2) After RM transferred application's state to killed, and before AM container 
actually killed by NM, the AM asked to finish application
Since the RMAppAttempt already called AMS.unregisterAttempt, the attempt will 
be cleaned from cache, thus the InvalidApplicationMasterRequestException will 
be raised.

I guess this after reading log uploaded by [~keyki], 
Still pretty good in following log,
{code}
2014-03-18 19:36:50,802 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1395167286771_0002 State change from ACCEPTED to RUNNING
2014-03-18 19:36:52,534 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1395167286771_0002_01_000002 Container Transitioned from NEW to 
ALLOCATED
2014-03-18 19:36:52,534 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=keyki    
OPERATION=AM Allocated Container        TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1395167286771_0002    
CONTAINERID=container_1395167286771_0002_01_000002
2014-03-18 19:36:52,534 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode:
 Assigned container container_1395167286771_0002_01_000002 of capacity 
<memory:1024, vCores:1> on host localhost:56214, which currently has 2 
containers, <memory:2048, vCores:2> used and <memory:6144, vCores:6> available
2014-03-18 19:36:52,534 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
assignedContainer application=application_1395167286771_0002 
container=Container: [ContainerId: container_1395167286771_0002_01_000002, 
NodeId: localhost:56214, NodeHttpAddress: localhost:8042, Resource: 
<memory:1024, vCores:1>, Priority: 1, Token: Token { kind: ContainerToken, 
service: 127.0.0.1:56214 }, ] 
containerId=container_1395167286771_0002_01_000002 queue=default: capacity=1.0, 
absoluteCapacity=1.0, usedResources=<memory:1024, vCores:1>usedCapacity=0.125, 
absoluteUsedCapacity=0.125, numApps=1, numContainers=1 usedCapacity=0.125 
absoluteUsedCapacity=0.125 used=<memory:1024, vCores:1> cluster=<memory:8192, 
vCores:8>
2014-03-18 19:36:52,534 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Re-sorting assigned queue: root.default stats: default: capacity=1.0, 
absoluteCapacity=1.0, usedResources=<memory:2048, vCores:2>usedCapacity=0.25, 
absoluteUsedCapacity=0.25, numApps=1, numContainers=2
2014-03-18 19:36:52,535 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
assignedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 
used=<memory:2048, vCores:2> cluster=<memory:8192, vCores:8>
2014-03-18 19:36:52,961 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1395167286771_0002_01_000002 Container Transitioned from ALLOCATED to 
ACQUIRED
2014-03-18 19:36:53,536 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1395167286771_0002_01_000002 Container Transitioned from ACQUIRED to 
RUNNING
{code}

Client asked kill application, and AMS.unregisterAttempt called, attempt will 
be removed from AMS cache
{code}
2014-03-18 19:38:50,427 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=keyki    
IP=37.139.29.192        OPERATION=Kill Application Request      
TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1395167286771_0002
2014-03-18 19:38:50,427 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
info for app: application_1395167286771_0002
2014-03-18 19:38:50,427 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1395167286771_0002 State change from RUNNING to KILLED
2014-03-18 19:38:50,428 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Unregistering app attempt : appattempt_1395167286771_0002_000001
{code}

After that, AM asked finishApplication, but unfortunately, attempt is already 
removed from cache
{code}
2014-03-18 19:38:51,397 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
AppAttemptId doesnt exist in cache appattempt_1395167286771_0002_000001
2014-03-18 19:38:52,415 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Application doesn't exist in cache appattempt_1395167286771_0002_000001
{code}

I'm not sure if it's possible in current Hoya design, please correct me if I 
was wrong. 

> InvalidApplicationMasterRequestException raised during AM-requested shutdown
> ----------------------------------------------------------------------------
>
>                 Key: YARN-1842
>                 URL: https://issues.apache.org/jira/browse/YARN-1842
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: Steve Loughran
>            Priority: Minor
>         Attachments: hoyalogs.tar.gz
>
>
> Report of the RM raising a stack trace 
> [https://gist.github.com/matyix/9596735] during AM-initiated shutdown. The AM 
> could just swallow this and exit, but it could be a sign of a race condition 
> YARN-side, or maybe just in the RM client code/AM dual signalling the 
> shutdown. 
> I haven't replicated this myself; maybe the stack will help track down the 
> problem. Otherwise: what is the policy YARN apps should adopt for AM's 
> handling errors on shutdown? go straight to an exit(-1)?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1842) InvalidApplicationMasterRequestException raised during AM-requested shutdown

Reply via email to