[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598384#comment-14598384
 ] 

Jason Lowe commented on YARN-2902:
----------------------------------

Thanks for updating the patch, Varun!

Is one second enough time for the localizer to tear down if the system is 
heavily loaded, disks are slow, etc.?  I think it would be better for the 
executor to let us know when a localizer has completed rather than assuming 1 
second will be enough time (or too much time).  We can tackle this in a 
followup JIRA since it's a more significant change, as I'm not sure executors 
are tracking localizers today.

There are a number of sleeps in the unit test which we should try to avoid if 
possible.  Is there a reason dispatcher.await() isn't sufficient to avoid the 
races?  At a minimum there should be a comment for each one explaining what 
we're trying to avoid by sleeping.

Nit: I've always interpreted the debug delay to be a delay to execute in 
debugging just before the NM deletes a file.  To be consistent it seems that we 
should be adding the debug delay to any requested delay.  That way the NM will 
always preserve a file for debugDelay seconds _beyond_ what an NM with 
debugDelay=0 seconds would do.

Nit: The TODO in DeletionService about parent being owned by NM, etc. probably 
only needs to be in the delete method that actually does the work rather than 
duplicated in veneer methods.

Nit: Should "Container killed while downloading" be "Container killed while 
localizing"?  We use localizing elsewhere (e.g.: NM log UI when trying to get 
logs of a container that is still localizing).

Nit: "Inorrect path for PRIVATE localization." should be "Incorrect path for 
PRIVATE localization: " to fix typo and add trailing space for subsequent 
filename.  Missing a trailing space on the next log message as well.  Realize 
this was just a pre-existing bug, but it would be nice to fix as part of moving 
the code.



> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to