Meng Zhu created MESOS-9940:
-------------------------------
Summary: Framework removal may lead to inconsistent task states
between master and agent.
Key: MESOS-9940
URL: https://issues.apache.org/jira/browse/MESOS-9940
Project: Mesos
Issue Type: Bug
Components: master
Reporter: Meng Zhu
When a framework is removed from the master (say due to disconnection), master
sends a `ShutdownFrameworkMessage` to the agent. At the same time, master would
transition the task status to e.g. KILLED.
(https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11247-L11291)
When agent got the shutdown message, it would try to shutdown all the executor
and destroy all the containers. The tasks' status is updated after all these
are done.
(https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L7914-L7922)
However, if the executor shutdown gets stuck (e.g. due to hanging docker
daemon), the task status transition will never happen. And master and agent
will have diverged view of these tasks.
One consequence is that masters may try to schedule more workloads onto the
problematic agent (because it thinks those task resources are freed up). Since
we do not have overcommit check on agent, agent will comply and launch those
tasks. This will lead to over-allocation.
One possible solution is to hold on the master status update until the agent is
done with the framework shutdown.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)