[ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371846#comment-14371846
 ] 

Ian Downes commented on MESOS-2367:
-----------------------------------

This is similar to what I'm proposing but skirts the real issue of how to 
handle orphans that cannot be destroyed? i.e., what does the containerizer do 
with the orphans: (3) says it destroys them but this ultimately calls the same 
code that's failing to destroy a container now?

> Improve slave resiliency in the face of orphan containers 
> ----------------------------------------------------------
>
>                 Key: MESOS-2367
>                 URL: https://issues.apache.org/jira/browse/MESOS-2367
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>            Reporter: Joe Smith
>            Priority: Critical
>
> Right now there's a case where a misbehaving executor can cause a slave 
> process to flap:
> {panel:title=Quote From [~jieyu]}
> {quote}
> 1) User tries to kill an instance
> 2) Slave sends {{KillTaskMessage}} to executor
> 3) Executor sends kill signals to task processes
> 4) Executor sends {{TASK_KILLED}} to slave
> 5) Slave updates container cpu limit to be 0.01 cpus
> 6) A user-process is still processing the kill signal
> 7) the task process cannot exit since it has too little cpu share and is 
> throttled
> 8) Executor itself terminates
> 9) Slave tries to destroy the container, but cannot because the user-process 
> is stuck in the exit path.
> 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
> containers
> {quote}
> {panel}
> The slave's orphan container handling should be improved to deal with this 
> case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to