[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976928#comment-14976928 ]
Varun Saxena commented on YARN-2902: ------------------------------------ Thanks a lot [~jlowe] for the review. I was under the incorrect impression that the resource downloading will not be taken up by other containers again. You are correct we should not FAIL the resource here. It will be taken up by outstanding container when next HB comes. If we do not call handleDownloadingRsrcsOnCleanup, we wont require to synchronize scheduled map as well. Also event.getResource().getLocalPath() can be used here too. This would preclude the need for ScheduledResource class and hence the refactoring associated with it. However, as resource would not be explicitly FAILED in this case, we should probably do some cleanup when reference count of downloading resource becomes 0. Otherwise entry associated with the downloading resource will remain in LocalResourcesTrackerImpl#localResourceMap and this may show up when cache cleanup is done. And we may turn up with the same log {{LOG.error("Attempt to remove resource: " + rsrc + " with non-zero refcount");}} even though the resource is deleted on disk. I think in LocalResourcesTrackerImpl#handle, after handling RELEASE event, we should check if the reference count is 0 and whether state of resource is DOWNLOADING. And if this is so, call LocalResourcesTrackerImpl#removeResource. Thoughts ? > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > ------------------------------------------------------------------------------------ > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.5.0 > Reporter: Jason Lowe > Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)