[ https://issues.apache.org/jira/browse/MESOS-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962162#comment-16962162 ]
Benjamin Bannier commented on MESOS-9940: ----------------------------------------- [~greggomann], let's put this back into the backlog for now and reestimate it. > Framework removal may lead to inconsistent task states between master and > agent. > -------------------------------------------------------------------------------- > > Key: MESOS-9940 > URL: https://issues.apache.org/jira/browse/MESOS-9940 > Project: Mesos > Issue Type: Bug > Components: master > Reporter: Meng Zhu > Assignee: Benjamin Bannier > Priority: Major > Labels: foundations > > When a framework is removed from the master (say due to disconnection), > master sends a `ShutdownFrameworkMessage` to the agent. At the same time, > master would transition the task status to e.g. KILLED. > (https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11247-L11291) > When agent got the shutdown message, it would try to shutdown all the > executor and destroy all the containers. The tasks' status is updated after > all these are done. > (https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L7914-L7922) > However, if the executor shutdown gets stuck (e.g. due to hanging docker > daemon), the task status transition will never happen. And master and agent > will have diverged view of these tasks. > One consequence is that masters may try to schedule more workloads onto the > problematic agent (because it thinks those task resources are freed up). > Since we do not have overcommit check on agent, agent will comply and launch > those tasks. This will lead to over-allocation. > One possible solution is to hold on the master status update until the agent > is done with the framework shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005)