[ 
https://issues.apache.org/jira/browse/MESOS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5380:
--------------------------
    Description: 
We observed that in our testing environment. So here is the sequence of events:

1) A command task is queued, the executor is not registered yet
2) The framework issues a killTask
3) Since executor is in REGISTERING state, agent calls 
`statusUpdate(TASK_KILLED, UPID())`
4) `statusUpdate` now will call `containerizer->status()` before calling 
`executor->terminateTask(status.task_id(), status);` which will remove the 
queued task. (introduced in this patch https://reviews.apache.org/r/43258).
5) Since the above is async, it's possible that the task is still in queued 
task when we trying to see if we need to kill unregistered executor in 
`killTask`:
{code}
      // TODO(jieyu): Here, we kill the executor if it no longer has
      // any task to run and has not yet registered. This is a
      // workaround for those single task executors that do not have a
      // proper self terminating logic when they haven't received the
      // task within a timeout.
      if (executor->queuedTasks.empty()) {
        CHECK(executor->launchedTasks.empty())
            << " Unregistered executor '" << executor->id
            << "' has launched tasks";

        LOG(WARNING) << "Killing the unregistered executor " << *executor
                     << " because it has no tasks";

        executor->state = Executor::TERMINATING;

        containerizer->destroy(executor->containerId);
      }    
{code}

6) The executor will never be terminated by Mesos after that.

  was:
We observed that in our testing environment. So here is the sequence of events:

1) A command task is queued, the executor is not registered yet
2) The framework issues a killTask
3) Since executor is in REGISTERING state, agent calls 
`statusUpdate(TASK_KILLED, UPID())`
4) `statusUpdate` now will call `containerizer->status()` before calling 
`executor->terminateTask(status.task_id(), status);` which will remove the 
queued task. (introduced in this patch https://reviews.apache.org/r/43258).
5) Since the above is async, it's possible that the task is still in queued 
task when we trying to see if we need to kill unregistered executor in 
`killTask`:
```
      // TODO(jieyu): Here, we kill the executor if it no longer has
      // any task to run and has not yet registered. This is a
      // workaround for those single task executors that do not have a
      // proper self terminating logic when they haven't received the
      // task within a timeout.
      if (executor->queuedTasks.empty()) {
        CHECK(executor->launchedTasks.empty())
            << " Unregistered executor '" << executor->id
            << "' has launched tasks";

        LOG(WARNING) << "Killing the unregistered executor " << *executor
                     << " because it has no tasks";

        executor->state = Executor::TERMINATING;

        containerizer->destroy(executor->containerId);
      }    
```
6) The executor will never be terminated by Mesos after that.


> Killing a queued task can cause the corresponding command executor never 
> terminates.
> ------------------------------------------------------------------------------------
>
>                 Key: MESOS-5380
>                 URL: https://issues.apache.org/jira/browse/MESOS-5380
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.28.0, 0.28.1
>            Reporter: Jie Yu
>            Assignee: Vinod Kone
>            Priority: Blocker
>             Fix For: 0.29.0, 0.28.2
>
>
> We observed that in our testing environment. So here is the sequence of 
> events:
> 1) A command task is queued, the executor is not registered yet
> 2) The framework issues a killTask
> 3) Since executor is in REGISTERING state, agent calls 
> `statusUpdate(TASK_KILLED, UPID())`
> 4) `statusUpdate` now will call `containerizer->status()` before calling 
> `executor->terminateTask(status.task_id(), status);` which will remove the 
> queued task. (introduced in this patch https://reviews.apache.org/r/43258).
> 5) Since the above is async, it's possible that the task is still in queued 
> task when we trying to see if we need to kill unregistered executor in 
> `killTask`:
> {code}
>       // TODO(jieyu): Here, we kill the executor if it no longer has
>       // any task to run and has not yet registered. This is a
>       // workaround for those single task executors that do not have a
>       // proper self terminating logic when they haven't received the
>       // task within a timeout.
>       if (executor->queuedTasks.empty()) {
>         CHECK(executor->launchedTasks.empty())
>             << " Unregistered executor '" << executor->id
>             << "' has launched tasks";
>         LOG(WARNING) << "Killing the unregistered executor " << *executor
>                      << " because it has no tasks";
>         executor->state = Executor::TERMINATING;
>         containerizer->destroy(executor->containerId);
>       }    
> {code}
> 6) The executor will never be terminated by Mesos after that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to