[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371846#comment-14371846 ]
Ian Downes commented on MESOS-2367: ----------------------------------- This is similar to what I'm proposing but skirts the real issue of how to handle orphans that cannot be destroyed? i.e., what does the containerizer do with the orphans: (3) says it destroys them but this ultimately calls the same code that's failing to destroy a container now? > Improve slave resiliency in the face of orphan containers > ---------------------------------------------------------- > > Key: MESOS-2367 > URL: https://issues.apache.org/jira/browse/MESOS-2367 > Project: Mesos > Issue Type: Bug > Components: slave > Reporter: Joe Smith > Priority: Critical > > Right now there's a case where a misbehaving executor can cause a slave > process to flap: > {panel:title=Quote From [~jieyu]} > {quote} > 1) User tries to kill an instance > 2) Slave sends {{KillTaskMessage}} to executor > 3) Executor sends kill signals to task processes > 4) Executor sends {{TASK_KILLED}} to slave > 5) Slave updates container cpu limit to be 0.01 cpus > 6) A user-process is still processing the kill signal > 7) the task process cannot exit since it has too little cpu share and is > throttled > 8) Executor itself terminates > 9) Slave tries to destroy the container, but cannot because the user-process > is stuck in the exit path. > 10) Slave restarts, and is constantly flapping because it cannot kill orphan > containers > {quote} > {panel} > The slave's orphan container handling should be improved to deal with this > case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)