[ https://issues.apache.org/jira/browse/MESOS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jie Yu updated MESOS-5380: -------------------------- Labels: mesosphere (was: ) > Killing a queued task can cause the corresponding command executor to never > terminate. > -------------------------------------------------------------------------------------- > > Key: MESOS-5380 > URL: https://issues.apache.org/jira/browse/MESOS-5380 > Project: Mesos > Issue Type: Bug > Components: slave > Affects Versions: 0.28.0, 0.28.1 > Reporter: Jie Yu > Assignee: Vinod Kone > Priority: Blocker > Labels: mesosphere > Fix For: 0.29.0, 0.28.2 > > > We observed this in our testing environment. Sequence of events: > 1) A command task is queued since the executor has not registered yet. > 2) The framework issues a killTask. > 3) Since executor is in REGISTERING state, agent calls > `statusUpdate(TASK_KILLED, UPID())` > 4) `statusUpdate` now will call `containerizer->status()` before calling > `executor->terminateTask(status.task_id(), status);` which will remove the > queued task. (Introduced in this patch: https://reviews.apache.org/r/43258). > 5) Since the above is async, it's possible that the task is still in queued > task when we trying to see if we need to kill unregistered executor in > `killTask`: > {code} > // TODO(jieyu): Here, we kill the executor if it no longer has > // any task to run and has not yet registered. This is a > // workaround for those single task executors that do not have a > // proper self terminating logic when they haven't received the > // task within a timeout. > if (executor->queuedTasks.empty()) { > CHECK(executor->launchedTasks.empty()) > << " Unregistered executor '" << executor->id > << "' has launched tasks"; > LOG(WARNING) << "Killing the unregistered executor " << *executor > << " because it has no tasks"; > executor->state = Executor::TERMINATING; > containerizer->destroy(executor->containerId); > } > {code} > 6) Consequently, the executor will never be terminated by Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)