[ https://issues.apache.org/jira/browse/MESOS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299446#comment-15299446 ]
Vinod Kone commented on MESOS-5380: ----------------------------------- Committed phase 2. commit 8be9b5b5decd9ec2bcad547b1dff29b777cbc438 Author: Vinod Kone <vinodk...@gmail.com> Date: Sun May 15 12:31:31 2016 -0700 Fixed agent to properly handle killTask during agent restart. If the agent restarts after handling killTask but before sending shutdown message to the executor, we ensure the executor terminates. Review: https://reviews.apache.org/r/47402 > Killing a queued task can cause the corresponding command executor to never > terminate. > -------------------------------------------------------------------------------------- > > Key: MESOS-5380 > URL: https://issues.apache.org/jira/browse/MESOS-5380 > Project: Mesos > Issue Type: Bug > Components: slave > Affects Versions: 0.28.0, 0.28.1 > Reporter: Jie Yu > Assignee: Vinod Kone > Priority: Blocker > Labels: mesosphere > Fix For: 0.29.0, 0.28.2 > > > We observed this in our testing environment. Sequence of events: > 1) A command task is queued since the executor has not registered yet. > 2) The framework issues a killTask. > 3) Since executor is in REGISTERING state, agent calls > `statusUpdate(TASK_KILLED, UPID())` > 4) `statusUpdate` now will call `containerizer->status()` before calling > `executor->terminateTask(status.task_id(), status);` which will remove the > queued task. (Introduced in this patch: https://reviews.apache.org/r/43258). > 5) Since the above is async, it's possible that the task is still in queued > task when we trying to see if we need to kill unregistered executor in > `killTask`: > {code} > // TODO(jieyu): Here, we kill the executor if it no longer has > // any task to run and has not yet registered. This is a > // workaround for those single task executors that do not have a > // proper self terminating logic when they haven't received the > // task within a timeout. > if (executor->queuedTasks.empty()) { > CHECK(executor->launchedTasks.empty()) > << " Unregistered executor '" << executor->id > << "' has launched tasks"; > LOG(WARNING) << "Killing the unregistered executor " << *executor > << " because it has no tasks"; > executor->state = Executor::TERMINATING; > containerizer->destroy(executor->containerId); > } > {code} > 6) Consequently, the executor will never be terminated by Mesos. > Attaching the relevant agent log: > {noformat} > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.640527 1342 slave.cpp:1361] Got assigned task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.641034 1342 slave.cpp:1480] Launching task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.641440 1342 paths.cpp:528] Trying to chown > '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a' > to user 'root' > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.644664 1342 slave.cpp:5389] Launching executor > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 with resources cpus(*):0.1; > mem(*):32 in work directory > '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a' > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.645195 1342 slave.cpp:1698] Queuing task > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' for executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.645491 1338 containerizer.cpp:671] Starting container > '24762d43-2134-475e-b724-caa72110497a' for executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > 'a3ad8418-cb77-4705-b353-4b514ceca52c-0000' > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.647897 1345 cpushare.cpp:389] Updated 'cpu.shares' to 1126 > (cpus 1.1) for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.648619 1345 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 110ms (cpus 1.1) for container > 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.650180 1341 mem.cpp:602] Started listening for OOM events for > container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.650718 1341 mem.cpp:722] Started listening on low memory > pressure events for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.651147 1341 mem.cpp:722] Started listening on medium memory > pressure events for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.651599 1341 mem.cpp:722] Started listening on critical memory > pressure events for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.652015 1341 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' > to 160MB for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.652719 1341 mem.cpp:388] Updated 'memory.limit_in_bytes' to > 160MB for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.508930 1342 slave.cpp:1891] Asked to kill task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.509063 1342 slave.cpp:3048] Handling status update > TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 from @0.0.0.0:0 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.509702 1340 disk.cpp:169] Updating the disk resources for > container 24762d43-2134-475e-b724-caa72110497a to cpus(*):0.1; mem(*):32 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.510298 1343 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' > to 32MB for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.510349 1341 cpushare.cpp:389] Updated 'cpu.shares' to 102 > (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.511102 1343 mem.cpp:388] Updated 'memory.limit_in_bytes' to > 32MB for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.511495 1341 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container > 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.511715 1341 status_update_manager.cpp:320] Received status > update TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.512032 1341 status_update_manager.cpp:824] Checkpointing > UPDATE for status update TASK_KILLED (UUID: > f9d15955-6c9a-4a73-98c3-97c0128510ba) for task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.513849 1343 slave.cpp:3446] Forwarding the update TASK_KILLED > (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 to master@10.0.5.79:5050 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.528929 1344 status_update_manager.cpp:392] Received status > update acknowledgement (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.529002 1344 status_update_manager.cpp:824] Checkpointing ACK > for status update TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) > for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.199105 1345 isolator.cpp:469] Mounting docker volume mount > point '//var/lib/rexray/volumes/jdef-test-125/data' to > '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/data' > for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.207062 1338 containerizer.cpp:1184] Checkpointing executor's > forked pid 5810 to > '/var/lib/mesos/slave/meta/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/pids/forked.pid' > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.832330 1338 slave.cpp:2689] Got registration for executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 from executor(1)@10.0.2.74:46154 > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.833149 1345 disk.cpp:169] Updating the disk resources for > container 24762d43-2134-475e-b724-caa72110497a to cpus(*):0.1; mem(*):32 > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.833804 1342 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' > to 32MB for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.833871 1340 cpushare.cpp:389] Updated 'cpu.shares' to 102 > (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.835160 1340 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container > 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > 5804 'mesos-logrotate-logger --help=false > --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stdout > --logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB ' > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > 5809 'mesos-logrotate-logger --help=false > --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stderr > --logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB ' > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > 5804 'mesos-logrotate-logger --help=false > --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stdout > --logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB ' > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > 5809 'mesos-logrotate-logger --help=false > --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stderr > --logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB ' > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.374567 30993 slave.cpp:5498] Recovering executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.420411 30990 status_update_manager.cpp:208] Recovering > executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.513164 30994 containerizer.cpp:467] Recovering container > '24762d43-2134-475e-b724-caa72110497a' for executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.533478 30988 mem.cpp:602] Started listening for OOM events for > container 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.534553 30988 mem.cpp:722] Started listening on low memory > pressure events for container 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.535269 30988 mem.cpp:722] Started listening on medium memory > pressure events for container 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.536198 30988 mem.cpp:722] Started listening on critical memory > pressure events for container 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.579385 30988 docker.cpp:859] Skipping recovery of executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > 'a3ad8418-cb77-4705-b353-4b514ceca52c-0000' because it was not launched from > docker containerizer > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.587158 30989 slave.cpp:4527] Sending reconnect request to > executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 at executor(1)@10.0.2.74:46154 > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.588287 30990 slave.cpp:2838] Re-registering executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.589736 30988 disk.cpp:169] Updating the disk resources for > container 24762d43-2134-475e-b724-caa72110497a to cpus(*):0.1; mem(*):32 > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.590117 30990 cpushare.cpp:389] Updated 'cpu.shares' to 102 > (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.591284 30990 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container > 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.595403 30992 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' > to 32MB for container 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.596102 30992 mem.cpp:388] Updated 'memory.limit_in_bytes' to > 32MB for container 24762d43-2134-475e-b724-caa72110497a > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)