[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers

Ian Downes (JIRA) Thu, 19 Mar 2015 23:02:35 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370808#comment-14370808
 ]


Ian Downes commented on MESOS-2367:
-----------------------------------

My opinion is that we should not change the contract between the launcher and 
the isolators: Isolator::cleanup will only be called when all processes in the 
container have been terminated.

Why?
1. The single launcher is responsible for container process lifetime, multiple 
isolators are responsible for isolating those processes
2. Many isolators cannot complete cleanup until all processes are destroyed - 
in which case they're all trying to do the same thing the launcher is doing, or 
what else could the isolators do differently?
3. Isolators are ordered arbitrarily and called concurrently, so there's no way 
to ensure, for example, the cpu isolator is called first.

My suggestion is that we do the following:
1. Make orphan clean up failures non-fatal so the slave will start and we gain 
control over running tasks.
2. Add a counter for the number of containers that failed to be destroyed, 
separately counting those that fail on normal destroy and those orphans that 
fail to be destroyed. Operators can monitor these counters and act 
appropriately.
3. Add to the launcher destroy code to handle the case described here where 
there are processes terminating (unmapping pages) but not making progress 
because of the very low cpu quota (the minimum is 0.01). If cgroup::destroy() 
timed out out it would examine the process's cgroup (/proc/\[pid\]/cgroup) and 
increase the cpu quota to something like 0.5 or 1.0 cpu and try again. This is 
a workaround and it is going around the cpu isolator but I don't see a cleaner 
way to do it.

The case that I triaged had a JVM process with 16 GB of anonymous pages to 
unmap and it took around 16 seconds once the cpu quota was increased. I expect 
one or two additional attempts at terminating the processes and cgroup::destroy 
to be successful in all but the most extreme cases of this scenario.

Regardless of success (there are other potential failure modes), (1) and (2) 
would enable the slave to come back up and to alert operators.

> Improve slave resiliency in the face of orphan containers 
> ----------------------------------------------------------
>
>                 Key: MESOS-2367
>                 URL: https://issues.apache.org/jira/browse/MESOS-2367
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>            Reporter: Joe Smith
>            Priority: Critical
>
> Right now there's a case where a misbehaving executor can cause a slave 
> process to flap:
> {panel:title=Quote From [~jieyu]}
> {quote}
> 1) User tries to kill an instance
> 2) Slave sends {{KillTaskMessage}} to executor
> 3) Executor sends kill signals to task processes
> 4) Executor sends {{TASK_KILLED}} to slave
> 5) Slave updates container cpu limit to be 0.01 cpus
> 6) A user-process is still processing the kill signal
> 7) the task process cannot exit since it has too little cpu share and is 
> throttled
> 8) Executor itself terminates
> 9) Slave tries to destroy the container, but cannot because the user-process 
> is stuck in the exit path.
> 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
> containers
> {quote}
> {panel}
> The slave's orphan container handling should be improved to deal with this 
> case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers

Reply via email to