[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598384#comment-14598384 ]
Jason Lowe commented on YARN-2902: ---------------------------------- Thanks for updating the patch, Varun! Is one second enough time for the localizer to tear down if the system is heavily loaded, disks are slow, etc.? I think it would be better for the executor to let us know when a localizer has completed rather than assuming 1 second will be enough time (or too much time). We can tackle this in a followup JIRA since it's a more significant change, as I'm not sure executors are tracking localizers today. There are a number of sleeps in the unit test which we should try to avoid if possible. Is there a reason dispatcher.await() isn't sufficient to avoid the races? At a minimum there should be a comment for each one explaining what we're trying to avoid by sleeping. Nit: I've always interpreted the debug delay to be a delay to execute in debugging just before the NM deletes a file. To be consistent it seems that we should be adding the debug delay to any requested delay. That way the NM will always preserve a file for debugDelay seconds _beyond_ what an NM with debugDelay=0 seconds would do. Nit: The TODO in DeletionService about parent being owned by NM, etc. probably only needs to be in the delete method that actually does the work rather than duplicated in veneer methods. Nit: Should "Container killed while downloading" be "Container killed while localizing"? We use localizing elsewhere (e.g.: NM log UI when trying to get logs of a container that is still localizing). Nit: "Inorrect path for PRIVATE localization." should be "Incorrect path for PRIVATE localization: " to fix typo and add trailing space for subsequent filename. Missing a trailing space on the next log message as well. Realize this was just a pre-existing bug, but it would be nice to fix as part of moving the code. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > ------------------------------------------------------------------------------------ > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.5.0 > Reporter: Jason Lowe > Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)