[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629348#comment-14629348 ]
Sunil G commented on YARN-3535: ------------------------------- Hi [~rohithsharma] and [~peng.zhang] After seeing this patch, I feel there may a synchronization problem. Please correct me if I am wrong. In ContainerRescheduledTransition code, its been used like {code} + container.eventHandler.handle(new ContainerRescheduledEvent(container)); + new FinishedTransition().transition(container, event); {code} Hence ContainerRescheduledEvent is fired to Scheduler dispatcher and it will process the {{recoverResourceRequestForContainer}} is a separate thread. Meantime in RMAppImpl, {{FinishedTransition().transition}} will be invoked and it will be processed for closure for this container. If the Scheduler dispatcher is slower in processing due to pending event queue length, there are chances that recoverResourceRequest may not be correct. I feel we can introduce a new Event in {{RMContainerImpl}} from ALLOCATED to WAIT_FOR_REQUEST_RECOVERY and scheduler can fire back an event to {{RMContainerImpl}} indicate recovery of resource request is completed. This can move the state forward to KILLED in {{RMContainerImpl}}. Please share your thoughts. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > --------------------------------------------------------------------------------------------- > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: Peng Zhang > Assignee: Peng Zhang > Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)