[ https://issues.apache.org/jira/browse/MESOS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jie Yu updated MESOS-5380: -------------------------- Description: We observed that in our testing environment. So here is the sequence of events: 1) A command task is queued, the executor is not registered yet 2) The framework issues a killTask 3) Since executor is in REGISTERING state, agent calls `statusUpdate(TASK_KILLED, UPID())` 4) `statusUpdate` now will call `containerizer->status()` before calling `executor->terminateTask(status.task_id(), status);` which will remove the queued task. (introduced in this patch https://reviews.apache.org/r/43258). 5) Since the above is async, it's possible that the task is still in queued task when we trying to see if we need to kill unregistered executor in `killTask`: {code} // TODO(jieyu): Here, we kill the executor if it no longer has // any task to run and has not yet registered. This is a // workaround for those single task executors that do not have a // proper self terminating logic when they haven't received the // task within a timeout. if (executor->queuedTasks.empty()) { CHECK(executor->launchedTasks.empty()) << " Unregistered executor '" << executor->id << "' has launched tasks"; LOG(WARNING) << "Killing the unregistered executor " << *executor << " because it has no tasks"; executor->state = Executor::TERMINATING; containerizer->destroy(executor->containerId); } {code} 6) The executor will never be terminated by Mesos after that. was: We observed that in our testing environment. So here is the sequence of events: 1) A command task is queued, the executor is not registered yet 2) The framework issues a killTask 3) Since executor is in REGISTERING state, agent calls `statusUpdate(TASK_KILLED, UPID())` 4) `statusUpdate` now will call `containerizer->status()` before calling `executor->terminateTask(status.task_id(), status);` which will remove the queued task. (introduced in this patch https://reviews.apache.org/r/43258). 5) Since the above is async, it's possible that the task is still in queued task when we trying to see if we need to kill unregistered executor in `killTask`: ``` // TODO(jieyu): Here, we kill the executor if it no longer has // any task to run and has not yet registered. This is a // workaround for those single task executors that do not have a // proper self terminating logic when they haven't received the // task within a timeout. if (executor->queuedTasks.empty()) { CHECK(executor->launchedTasks.empty()) << " Unregistered executor '" << executor->id << "' has launched tasks"; LOG(WARNING) << "Killing the unregistered executor " << *executor << " because it has no tasks"; executor->state = Executor::TERMINATING; containerizer->destroy(executor->containerId); } ``` 6) The executor will never be terminated by Mesos after that. > Killing a queued task can cause the corresponding command executor never > terminates. > ------------------------------------------------------------------------------------ > > Key: MESOS-5380 > URL: https://issues.apache.org/jira/browse/MESOS-5380 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.28.0, 0.28.1 > Reporter: Jie Yu > Assignee: Vinod Kone > Priority: Blocker > Fix For: 0.29.0, 0.28.2 > > > We observed that in our testing environment. So here is the sequence of > events: > 1) A command task is queued, the executor is not registered yet > 2) The framework issues a killTask > 3) Since executor is in REGISTERING state, agent calls > `statusUpdate(TASK_KILLED, UPID())` > 4) `statusUpdate` now will call `containerizer->status()` before calling > `executor->terminateTask(status.task_id(), status);` which will remove the > queued task. (introduced in this patch https://reviews.apache.org/r/43258). > 5) Since the above is async, it's possible that the task is still in queued > task when we trying to see if we need to kill unregistered executor in > `killTask`: > {code} > // TODO(jieyu): Here, we kill the executor if it no longer has > // any task to run and has not yet registered. This is a > // workaround for those single task executors that do not have a > // proper self terminating logic when they haven't received the > // task within a timeout. > if (executor->queuedTasks.empty()) { > CHECK(executor->launchedTasks.empty()) > << " Unregistered executor '" << executor->id > << "' has launched tasks"; > LOG(WARNING) << "Killing the unregistered executor " << *executor > << " because it has no tasks"; > executor->state = Executor::TERMINATING; > containerizer->destroy(executor->containerId); > } > {code} > 6) The executor will never be terminated by Mesos after that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)