[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976928#comment-14976928
 ] 

Varun Saxena commented on YARN-2902:
------------------------------------

Thanks a lot [~jlowe] for the review.
I was under the incorrect impression that the resource downloading will not be 
taken up by other containers again. You are correct we should not FAIL the 
resource here. It will be taken up by outstanding container when next HB comes. 
If we do not call handleDownloadingRsrcsOnCleanup, we wont require to 
synchronize scheduled map as well.

Also event.getResource().getLocalPath()  can be used here too. This would 
preclude the need for ScheduledResource class and hence the refactoring 
associated with it.

However, as resource would not be explicitly FAILED in this case, we should 
probably do some cleanup when reference count of downloading resource becomes 
0. Otherwise entry associated with the downloading resource will remain in 
LocalResourcesTrackerImpl#localResourceMap and this may show up when cache 
cleanup is done.
And we may turn up with the same log {{LOG.error("Attempt to remove resource: " 
+ rsrc + " with non-zero refcount");}} even though the resource is deleted on 
disk.
I think in LocalResourcesTrackerImpl#handle, after handling RELEASE event, we 
should check if the reference count is 0 and whether state of resource is 
DOWNLOADING. And if this is so, call LocalResourcesTrackerImpl#removeResource.
Thoughts ?



> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, 
> YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to