[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513153#comment-14513153 ]
zhihai xu commented on YARN-3464: --------------------------------- thanks [~kasha] for the review and committing the patch, thanks [~jlowe] for the valuable feedback. > Race condition in LocalizerRunner kills localizer before localizing all > resources > --------------------------------------------------------------------------------- > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: zhihai xu > Assignee: zhihai xu > Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-3464.000.patch, YARN-3464.001.patch > > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)