[jira] [Commented] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources

2018-03-27 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415612#comment-16415612
 ] 

Benno Evers commented on MESOS-1466:


If I understand the issue correctly, this race seems to have been eliminated as 
a side-effect of introducing the `launch_executor` flag in Mesos 1.5:

When the master sends the `RunTaskMessage` to the agent, it thinks that the 
specified executor is still running on the agent, so it will set 
`launch_executor = false`:
{noformat}
// src/master/master.cpp:3841
bool Master::isLaunchExecutor(
    const ExecutorID& executorId,
    Framework* framework,
    Slave* slave) const
{
  CHECK_NOTNULL(framework);
  CHECK_NOTNULL(slave);

  if (!slave->hasExecutor(framework->id(), executorId)) {
    CHECK(!framework->hasExecutor(slave->id, executorId))
  << "Executor '" << executorId
  << "' known to the framework " << *framework
  << " but unknown to the agent " << *slave;

    return true;
  }

  return false;
}{noformat}
On the slave, when the executor doesn't exist anymore, the task is dropped with 
reason `REASON_EXECUTOR_TERMINATED`:
{noformat}
// src/slave/slave.cpp:2881

    // Master does not want to launch executor.
    if (executor == nullptr) {
  // Master wants no new executor launched and there is none running on
  // the agent. This could happen if the task expects some previous
  // tasks to launch the executor. However, the earlier task got killed
  // or dropped hence did not launch the executor but the master doesn't
  // know about it yet because the `ExitedExecutorMessage` is still in
  // flight. In this case, we will drop the task.
  //
  // We report TASK_DROPPED to the framework because the task was
  // never launched. For non-partition-aware frameworks, we report
  // TASK_LOST for backward compatibility.
  mesos::TaskState taskState = TASK_DROPPED;
  if (!protobuf::frameworkHasCapability(
  frameworkInfo, FrameworkInfo::Capability::PARTITION_AWARE)) {
    taskState = TASK_LOST;
  }

  foreach (const TaskInfo& _task, tasks) {
    const StatusUpdate update = protobuf::createStatusUpdate(
    frameworkId,
    info.id(),
    _task.task_id(),
    taskState,
    TaskStatus::SOURCE_SLAVE,
    id::UUID::random(),
    "No executor is expected to launch and there is none running",
    TaskStatus::REASON_EXECUTOR_TERMINATED,
    executorId);

    statusUpdate(update, UPID());
  }

  // We do not send `ExitedExecutorMessage` here because the expectation
  // is that there is already one on the fly to master. If the message
  // gets dropped, we will hopefully reconcile with the master later.

  return;
    }{noformat}

> Race between executor exited event and launch task can cause overcommit of 
> resources
> 
>
> Key: MESOS-1466
> URL: https://issues.apache.org/jira/browse/MESOS-1466
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Vinod Kone
>Priority: Major
>  Labels: reliability, twitter
>
> The following sequence of events can cause an overcommit
> --> Launch task is called for a task whose executor is already running
> --> Executor's resources are not accounted for on the master
> --> Executor exits and the event is enqueued behind launch tasks on the master
> --> Master sends the task to the slave which needs to commit for resources 
> for task and the (new) executor.
> --> Master processes the executor exited event and re-offers the executor's 
> resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources

2014-08-18 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101757#comment-14101757
 ] 

Benjamin Mahler commented on MESOS-1466:


We're going to proceed with a mitigation of this by rejecting tasks once the 
slave is overcommitted:
https://issues.apache.org/jira/browse/MESOS-1721

However, we would also like to ensure that this kind of race is not possible. 
One solution is to use master acknowledgments for executor exits:

(1) When an executor terminates (or the executor could not be launched: 
MESOS-1720), we send an exited executor message.
(2) The master acknowledges these message.
(3) The slave will not accept tasks for unacknowledged terminal executors (this 
must include those executors that could not be launched, per MESOS-1720).

The result of this is that a new executor cannot be launched until the master 
is aware of the old executor exiting.

 Race between executor exited event and launch task can cause overcommit of 
 resources
 

 Key: MESOS-1466
 URL: https://issues.apache.org/jira/browse/MESOS-1466
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Vinod Kone
Assignee: Benjamin Mahler
  Labels: reliability

 The following sequence of events can cause an overcommit
 -- Launch task is called for a task whose executor is already running
 -- Executor's resources are not accounted for on the master
 -- Executor exits and the event is enqueued behind launch tasks on the master
 -- Master sends the task to the slave which needs to commit for resources 
 for task and the (new) executor.
 -- Master processes the executor exited event and re-offers the executor's 
 resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)