[ 
https://issues.apache.org/jira/browse/YARN-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642576#comment-16642576
 ] 

Jason Lowe commented on YARN-7644:
----------------------------------

Thanks for the patch!  I'm a little concerned about the container executor lock 
that was added in YARN-8160 and used as part of this patch.  IIUC the 
launchContainer method for the executor is a synchronous, blocking call that 
won't return until the container completes.  For example, see 
DefaultContainerExecutor#launchContainer where it invokes 
Shell.CommandExecutor#execute.  That means the executor lock would be held 
continuously while the container is running.  Therefore I'm not sure how the 
thread running ContainerLaunch#reapContainer is going to obtain the executor 
lock to be able to proceed to kill the container.  Seems like it would just 
hang, but maybe I'm missing something.  This may be more of an issue with 
YARN-8160 than this one, as it looks like this mostly just refactored existing 
code to move it into a ContainerCleanup class.  To be honest I'm not quite sure 
what the purpose of the lock is, since there are many places we invoke the 
executor without the lock like deactivating and signalling.  The use of the 
lock seems inconsistent if it's supposed to guard when we are invoking the 
executor.

Nit: It feels a little off for ContainerCleanup to reach into fields directly, 
e.g.: launch.pidFilePath, launch.containerAlreadyLaunched, launch.completed, 
etc.  It made more sense when this code was part of ContainerLaunch because the 
field was private and no code other than the implementation details of 
ContainerLaunch needed to know.  Now there's another, separate class that is 
reaching in to grab all these fields.  Seems like either ContainerCleanup 
should be a private static class of ContainerLaunch or there should be 
accessors so we can keep these fields private.

ContainerLaunch.sleepDelayBeforeSigKill should be final like the other 
properties.  The assignment to 250 will always be clobbered by the constructor 
anyway.


> NM gets backed up deleting docker containers
> --------------------------------------------
>
>                 Key: YARN-7644
>                 URL: https://issues.apache.org/jira/browse/YARN-7644
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Eric Badger
>            Assignee: Chandni Singh
>            Priority: Major
>              Labels: Docker
>         Attachments: YARN-7644.001.patch, YARN-7644.002.patch
>
>
> We are sending a {{docker stop}} to the docker container with a timeout of 10 
> seconds when we shut down a container. If the container does not stop after 
> 10 seconds then we force kill it. However, the {{docker stop}} command is a 
> blocking call. So in cases where lots of containers don't go down with the 
> initial SIGTERM, we have to wait 10+ seconds for the {{docker stop}} to 
> return. This ties up the ContainerLaunch handler and so these kill events 
> back up. It also appears to be backing up new container launches as well. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to