[ 
https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742061#comment-16742061
 ] 

Wilfred Spiegelenburg commented on YARN-9194:
---------------------------------------------

Thank you for logging this jira [~xiaoheipangzi].

When I look at the logs I get the impression that the issue is in the way we 
lock and or track the node:
{code}
2019-01-13 08:52:11,249 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop15:43450 
Node Transitioned from RUNNING to SHUTDOWN
2019-01-13 08:52:15,221 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1547340702286_0001_01_000001 Container Transitioned from NEW to 
ALLOCATED
2019-01-13 08:52:15,224 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode:
 Assigned container container_1547340702286_0001_01_000001 of capacity 
<memory:2048, vCores:1> on host hadoop15:43450, which has 1 containers, 
<memory:2048, vCores:1> used and <memory:6144, vCores:7> available after 
allocation
...
2019-01-13 08:52:15,227 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
assignedContainer queue=root usedCapacity=0.125 absoluteUsedCapacity=0.125 
used=<memory:2048, vCores:1> cluster=<memory:16384, vCores:16>
2019-01-13 08:52:15,234 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Allocation proposal accepted
{code}

Based on this log there is a 4 second gap between allocation acceptance and the 
node removal. The node was removed *4 seconds* before the allocation the 
{{FiCaSchedulerNode}} was finished en the scheduler confirmed the allocation. 
That looks strange: assigning a container on a node that has already been 
removed. Based on this we probably should check the proposal and make sure that 
it is declined when the node is removed.

I also don't think it is a good idea to fail the application in this case. The 
container is never started and the failure is inside the scheduler. Failing an 
application when that happens is I don't think the correct action.

> Invalid event: REGISTERED at FAILED
> -----------------------------------
>
>                 Key: YARN-9194
>                 URL: https://issues.apache.org/jira/browse/YARN-9194
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: lujie
>            Assignee: lujie
>            Priority: Major
>         Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch, 
> hadoop-hires-resourcemanager-hadoop11.log
>
>
> While the attempt fails, the REGISTERED comes, hence the 
> InvalidStateTransitionException happens.
>  
> {code:java}
> 2019-01-13 00:41:57,127 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> App attempt: appattempt_1547311267249_0001_000002 can't handle this event at 
> current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> REGISTERED at FAILED
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to