[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484920#comment-14484920 ]
zhihai xu commented on YARN-3464: --------------------------------- This issue only happened for PRIVATE/APPLICATION resource Localization We saw this issue happened when the PRIVATE LocalizerResourceRequestEvent interleaved with PUBLIC LocalizerResourceRequestEvent in the following order: PRIVATE1 PRIVATE2 .......... PRIVATEm PUBLIC1 PUBLIC2 ..... PUBLICn PRIVATEm+1 PRIVATEm+2 The last two PRIVATE LocalizerResourceRequestEvent is added after all previous m PRIVATE LocalizerResourceRequestEvent are LOCALIZED due to the delay to process n PUBLIC LocalizerResourceRequestEvent. Then the container will stay at LOCALIZING state until it is killed by AM. > Race condition in LocalizerRunner causes container localization timeout. > ------------------------------------------------------------------------ > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: zhihai xu > Assignee: zhihai xu > Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)