[jira] [Comment Edited] (MESOS-5380) Killing a queued task can cause the corresponding command executor to never terminate.
[ https://issues.apache.org/jira/browse/MESOS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299539#comment-15299539 ] Vinod Kone edited comment on MESOS-5380 at 5/25/16 6:30 AM: Backported the fixes to 0.28.x. commit b52a41df090fdf15c65e01805adbb6795bc34e78 Author: Vinod KoneDate: Sun May 15 12:31:31 2016 -0700 Fixed agent to properly handle killTask during agent restart. ***Modified for 0.28.2*** If the agent restarts after handling killTask but before sending shutdown message to the executor, we ensure the executor terminates. Review: https://reviews.apache.org/r/47402 commit 8f73932f851096a2d1fdcd72b239be1e2e53cc58 Author: Vinod Kone Date: Fri May 13 16:06:49 2016 -0700 Fixed agent to properly handle killTask of unregistered executor. ***Modified for 0.28.2*** The agent now shuts down the executor during registration if it does not have any queued tasks (e.g., framework sent a killTask before registration). Note that if the executor doesn't register at all, it will be cleaned up anyway after the registration timeout value. Also, note that this doesn't handle the case where the agent restarts after processing the killTask() but before cleaning up the executor. Review: https://reviews.apache.org/r/47381 was (Author: vinodkone): Backported the fix to 0.28.x. > Killing a queued task can cause the corresponding command executor to never > terminate. > -- > > Key: MESOS-5380 > URL: https://issues.apache.org/jira/browse/MESOS-5380 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.28.0, 0.28.1 >Reporter: Jie Yu >Assignee: Vinod Kone >Priority: Blocker > Labels: mesosphere > Fix For: 0.29.0, 0.28.2 > > > We observed this in our testing environment. Sequence of events: > 1) A command task is queued since the executor has not registered yet. > 2) The framework issues a killTask. > 3) Since executor is in REGISTERING state, agent calls > `statusUpdate(TASK_KILLED, UPID())` > 4) `statusUpdate` now will call `containerizer->status()` before calling > `executor->terminateTask(status.task_id(), status);` which will remove the > queued task. (Introduced in this patch: https://reviews.apache.org/r/43258). > 5) Since the above is async, it's possible that the task is still in queued > task when we trying to see if we need to kill unregistered executor in > `killTask`: > {code} > // TODO(jieyu): Here, we kill the executor if it no longer has > // any task to run and has not yet registered. This is a > // workaround for those single task executors that do not have a > // proper self terminating logic when they haven't received the > // task within a timeout. > if (executor->queuedTasks.empty()) { > CHECK(executor->launchedTasks.empty()) > << " Unregistered executor '" << executor->id > << "' has launched tasks"; > LOG(WARNING) << "Killing the unregistered executor " << *executor > << " because it has no tasks"; > executor->state = Executor::TERMINATING; > containerizer->destroy(executor->containerId); > } > {code} > 6) Consequently, the executor will never be terminated by Mesos. > Attaching the relevant agent log: > {noformat} > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.640527 1342 slave.cpp:1361] Got assigned task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework > a3ad8418-cb77-4705-b353-4b514ceca52c- > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.641034 1342 slave.cpp:1480] Launching task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework > a3ad8418-cb77-4705-b353-4b514ceca52c- > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.641440 1342 paths.cpp:528] Trying to chown > '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a' > to user 'root' > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.644664 1342 slave.cpp:5389] Launching executor > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c- with resources cpus(*):0.1; > mem(*):32 in work directory >
[jira] [Comment Edited] (MESOS-5380) Killing a queued task can cause the corresponding command executor to never terminate.
[ https://issues.apache.org/jira/browse/MESOS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283944#comment-15283944 ] Vinod Kone edited comment on MESOS-5380 at 5/15/16 8:08 PM: Phase 2: https://reviews.apache.org/r/47402/ was (Author: vinodkone): https://reviews.apache.org/r/47402/ > Killing a queued task can cause the corresponding command executor to never > terminate. > -- > > Key: MESOS-5380 > URL: https://issues.apache.org/jira/browse/MESOS-5380 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.28.0, 0.28.1 >Reporter: Jie Yu >Assignee: Vinod Kone >Priority: Blocker > Labels: mesosphere > Fix For: 0.29.0, 0.28.2 > > > We observed this in our testing environment. Sequence of events: > 1) A command task is queued since the executor has not registered yet. > 2) The framework issues a killTask. > 3) Since executor is in REGISTERING state, agent calls > `statusUpdate(TASK_KILLED, UPID())` > 4) `statusUpdate` now will call `containerizer->status()` before calling > `executor->terminateTask(status.task_id(), status);` which will remove the > queued task. (Introduced in this patch: https://reviews.apache.org/r/43258). > 5) Since the above is async, it's possible that the task is still in queued > task when we trying to see if we need to kill unregistered executor in > `killTask`: > {code} > // TODO(jieyu): Here, we kill the executor if it no longer has > // any task to run and has not yet registered. This is a > // workaround for those single task executors that do not have a > // proper self terminating logic when they haven't received the > // task within a timeout. > if (executor->queuedTasks.empty()) { > CHECK(executor->launchedTasks.empty()) > << " Unregistered executor '" << executor->id > << "' has launched tasks"; > LOG(WARNING) << "Killing the unregistered executor " << *executor > << " because it has no tasks"; > executor->state = Executor::TERMINATING; > containerizer->destroy(executor->containerId); > } > {code} > 6) Consequently, the executor will never be terminated by Mesos. > Attaching the relevant agent log: > {noformat} > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.640527 1342 slave.cpp:1361] Got assigned task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework > a3ad8418-cb77-4705-b353-4b514ceca52c- > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.641034 1342 slave.cpp:1480] Launching task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework > a3ad8418-cb77-4705-b353-4b514ceca52c- > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.641440 1342 paths.cpp:528] Trying to chown > '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a' > to user 'root' > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.644664 1342 slave.cpp:5389] Launching executor > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c- with resources cpus(*):0.1; > mem(*):32 in work directory > '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a' > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.645195 1342 slave.cpp:1698] Queuing task > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' for executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c- > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.645491 1338 containerizer.cpp:671] Starting container > '24762d43-2134-475e-b724-caa72110497a' for executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > 'a3ad8418-cb77-4705-b353-4b514ceca52c-' > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.647897 1345 cpushare.cpp:389] Updated 'cpu.shares' to 1126 > (cpus 1.1) for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.648619 1345 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 110ms