[
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415612#comment-16415612
]
Benno Evers commented on MESOS-1466:
If I understand the issue correctly, this race seems to have been eliminated as
a side-effect of introducing the `launch_executor` flag in Mesos 1.5:
When the master sends the `RunTaskMessage` to the agent, it thinks that the
specified executor is still running on the agent, so it will set
`launch_executor = false`:
{noformat}
// src/master/master.cpp:3841
bool Master::isLaunchExecutor(
const ExecutorID& executorId,
Framework* framework,
Slave* slave) const
{
CHECK_NOTNULL(framework);
CHECK_NOTNULL(slave);
if (!slave->hasExecutor(framework->id(), executorId)) {
CHECK(!framework->hasExecutor(slave->id, executorId))
<< "Executor '" << executorId
<< "' known to the framework " << *framework
<< " but unknown to the agent " << *slave;
return true;
}
return false;
}{noformat}
On the slave, when the executor doesn't exist anymore, the task is dropped with
reason `REASON_EXECUTOR_TERMINATED`:
{noformat}
// src/slave/slave.cpp:2881
// Master does not want to launch executor.
if (executor == nullptr) {
// Master wants no new executor launched and there is none running on
// the agent. This could happen if the task expects some previous
// tasks to launch the executor. However, the earlier task got killed
// or dropped hence did not launch the executor but the master doesn't
// know about it yet because the `ExitedExecutorMessage` is still in
// flight. In this case, we will drop the task.
//
// We report TASK_DROPPED to the framework because the task was
// never launched. For non-partition-aware frameworks, we report
// TASK_LOST for backward compatibility.
mesos::TaskState taskState = TASK_DROPPED;
if (!protobuf::frameworkHasCapability(
frameworkInfo, FrameworkInfo::Capability::PARTITION_AWARE)) {
taskState = TASK_LOST;
}
foreach (const TaskInfo& _task, tasks) {
const StatusUpdate update = protobuf::createStatusUpdate(
frameworkId,
info.id(),
_task.task_id(),
taskState,
TaskStatus::SOURCE_SLAVE,
id::UUID::random(),
"No executor is expected to launch and there is none running",
TaskStatus::REASON_EXECUTOR_TERMINATED,
executorId);
statusUpdate(update, UPID());
}
// We do not send `ExitedExecutorMessage` here because the expectation
// is that there is already one on the fly to master. If the message
// gets dropped, we will hopefully reconcile with the master later.
return;
}{noformat}
> Race between executor exited event and launch task can cause overcommit of
> resources
>
>
> Key: MESOS-1466
> URL: https://issues.apache.org/jira/browse/MESOS-1466
> Project: Mesos
> Issue Type: Bug
> Components: allocation, master
>Reporter: Vinod Kone
>Priority: Major
> Labels: reliability, twitter
>
> The following sequence of events can cause an overcommit
> --> Launch task is called for a task whose executor is already running
> --> Executor's resources are not accounted for on the master
> --> Executor exits and the event is enqueued behind launch tasks on the master
> --> Master sends the task to the slave which needs to commit for resources
> for task and the (new) executor.
> --> Master processes the executor exited event and re-offers the executor's
> resources causing an overcommit of resources.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)