[ https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742061#comment-16742061 ]
Wilfred Spiegelenburg commented on YARN-9194: --------------------------------------------- Thank you for logging this jira [~xiaoheipangzi]. When I look at the logs I get the impression that the issue is in the way we lock and or track the node: {code} 2019-01-13 08:52:11,249 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop15:43450 Node Transitioned from RUNNING to SHUTDOWN 2019-01-13 08:52:15,221 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1547340702286_0001_01_000001 Container Transitioned from NEW to ALLOCATED 2019-01-13 08:52:15,224 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: Assigned container container_1547340702286_0001_01_000001 of capacity <memory:2048, vCores:1> on host hadoop15:43450, which has 1 containers, <memory:2048, vCores:1> used and <memory:6144, vCores:7> available after allocation ... 2019-01-13 08:52:15,227 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.125 absoluteUsedCapacity=0.125 used=<memory:2048, vCores:1> cluster=<memory:16384, vCores:16> 2019-01-13 08:52:15,234 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Allocation proposal accepted {code} Based on this log there is a 4 second gap between allocation acceptance and the node removal. The node was removed *4 seconds* before the allocation the {{FiCaSchedulerNode}} was finished en the scheduler confirmed the allocation. That looks strange: assigning a container on a node that has already been removed. Based on this we probably should check the proposal and make sure that it is declined when the node is removed. I also don't think it is a good idea to fail the application in this case. The container is never started and the failure is inside the scheduler. Failing an application when that happens is I don't think the correct action. > Invalid event: REGISTERED at FAILED > ----------------------------------- > > Key: YARN-9194 > URL: https://issues.apache.org/jira/browse/YARN-9194 > Project: Hadoop YARN > Issue Type: Bug > Reporter: lujie > Assignee: lujie > Priority: Major > Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch, > hadoop-hires-resourcemanager-hadoop11.log > > > While the attempt fails, the REGISTERED comes, hence the > InvalidStateTransitionException happens. > > {code:java} > 2019-01-13 00:41:57,127 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > App attempt: appattempt_1547311267249_0001_000002 can't handle this event at > current state > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > REGISTERED at FAILED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org