[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370808#comment-14370808 ]
Ian Downes commented on MESOS-2367: ----------------------------------- My opinion is that we should not change the contract between the launcher and the isolators: Isolator::cleanup will only be called when all processes in the container have been terminated. Why? 1. The single launcher is responsible for container process lifetime, multiple isolators are responsible for isolating those processes 2. Many isolators cannot complete cleanup until all processes are destroyed - in which case they're all trying to do the same thing the launcher is doing, or what else could the isolators do differently? 3. Isolators are ordered arbitrarily and called concurrently, so there's no way to ensure, for example, the cpu isolator is called first. My suggestion is that we do the following: 1. Make orphan clean up failures non-fatal so the slave will start and we gain control over running tasks. 2. Add a counter for the number of containers that failed to be destroyed, separately counting those that fail on normal destroy and those orphans that fail to be destroyed. Operators can monitor these counters and act appropriately. 3. Add to the launcher destroy code to handle the case described here where there are processes terminating (unmapping pages) but not making progress because of the very low cpu quota (the minimum is 0.01). If cgroup::destroy() timed out out it would examine the process's cgroup (/proc/\[pid\]/cgroup) and increase the cpu quota to something like 0.5 or 1.0 cpu and try again. This is a workaround and it is going around the cpu isolator but I don't see a cleaner way to do it. The case that I triaged had a JVM process with 16 GB of anonymous pages to unmap and it took around 16 seconds once the cpu quota was increased. I expect one or two additional attempts at terminating the processes and cgroup::destroy to be successful in all but the most extreme cases of this scenario. Regardless of success (there are other potential failure modes), (1) and (2) would enable the slave to come back up and to alert operators. > Improve slave resiliency in the face of orphan containers > ---------------------------------------------------------- > > Key: MESOS-2367 > URL: https://issues.apache.org/jira/browse/MESOS-2367 > Project: Mesos > Issue Type: Bug > Components: slave > Reporter: Joe Smith > Priority: Critical > > Right now there's a case where a misbehaving executor can cause a slave > process to flap: > {panel:title=Quote From [~jieyu]} > {quote} > 1) User tries to kill an instance > 2) Slave sends {{KillTaskMessage}} to executor > 3) Executor sends kill signals to task processes > 4) Executor sends {{TASK_KILLED}} to slave > 5) Slave updates container cpu limit to be 0.01 cpus > 6) A user-process is still processing the kill signal > 7) the task process cannot exit since it has too little cpu share and is > throttled > 8) Executor itself terminates > 9) Slave tries to destroy the container, but cannot because the user-process > is stuck in the exit path. > 10) Slave restarts, and is constantly flapping because it cannot kill orphan > containers > {quote} > {panel} > The slave's orphan container handling should be improved to deal with this > case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)