[ https://issues.apache.org/jira/browse/MESOS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jie Yu updated MESOS-5380: -------------------------- Fix Version/s: 0.28.2 > Killing a queued task can cause the corresponding command executor to never > terminate. > -------------------------------------------------------------------------------------- > > Key: MESOS-5380 > URL: https://issues.apache.org/jira/browse/MESOS-5380 > Project: Mesos > Issue Type: Bug > Components: slave > Affects Versions: 0.28.0, 0.28.1 > Reporter: Jie Yu > Assignee: Vinod Kone > Priority: Blocker > Labels: mesosphere > Fix For: 1.0.0, 0.28.2 > > > We observed this in our testing environment. Sequence of events: > 1) A command task is queued since the executor has not registered yet. > 2) The framework issues a killTask. > 3) Since executor is in REGISTERING state, agent calls > `statusUpdate(TASK_KILLED, UPID())` > 4) `statusUpdate` now will call `containerizer->status()` before calling > `executor->terminateTask(status.task_id(), status);` which will remove the > queued task. (Introduced in this patch: https://reviews.apache.org/r/43258). > 5) Since the above is async, it's possible that the task is still in queued > task when we trying to see if we need to kill unregistered executor in > `killTask`: > {code} > // TODO(jieyu): Here, we kill the executor if it no longer has > // any task to run and has not yet registered. This is a > // workaround for those single task executors that do not have a > // proper self terminating logic when they haven't received the > // task within a timeout. > if (executor->queuedTasks.empty()) { > CHECK(executor->launchedTasks.empty()) > << " Unregistered executor '" << executor->id > << "' has launched tasks"; > LOG(WARNING) << "Killing the unregistered executor " << *executor > << " because it has no tasks"; > executor->state = Executor::TERMINATING; > containerizer->destroy(executor->containerId); > } > {code} > 6) Consequently, the executor will never be terminated by Mesos. > Attaching the relevant agent log: > {noformat} > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.640527 1342 slave.cpp:1361] Got assigned task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.641034 1342 slave.cpp:1480] Launching task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.641440 1342 paths.cpp:528] Trying to chown > '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a' > to user 'root' > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.644664 1342 slave.cpp:5389] Launching executor > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 with resources cpus(*):0.1; > mem(*):32 in work directory > '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a' > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.645195 1342 slave.cpp:1698] Queuing task > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' for executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.645491 1338 containerizer.cpp:671] Starting container > '24762d43-2134-475e-b724-caa72110497a' for executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > 'a3ad8418-cb77-4705-b353-4b514ceca52c-0000' > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.647897 1345 cpushare.cpp:389] Updated 'cpu.shares' to 1126 > (cpus 1.1) for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.648619 1345 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 110ms (cpus 1.1) for container > 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.650180 1341 mem.cpp:602] Started listening for OOM events for > container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.650718 1341 mem.cpp:722] Started listening on low memory > pressure events for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.651147 1341 mem.cpp:722] Started listening on medium memory > pressure events for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.651599 1341 mem.cpp:722] Started listening on critical memory > pressure events for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.652015 1341 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' > to 160MB for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:13.652719 1341 mem.cpp:388] Updated 'memory.limit_in_bytes' to > 160MB for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.508930 1342 slave.cpp:1891] Asked to kill task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.509063 1342 slave.cpp:3048] Handling status update > TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 from @0.0.0.0:0 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.509702 1340 disk.cpp:169] Updating the disk resources for > container 24762d43-2134-475e-b724-caa72110497a to cpus(*):0.1; mem(*):32 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.510298 1343 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' > to 32MB for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.510349 1341 cpushare.cpp:389] Updated 'cpu.shares' to 102 > (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.511102 1343 mem.cpp:388] Updated 'memory.limit_in_bytes' to > 32MB for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.511495 1341 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container > 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.511715 1341 status_update_manager.cpp:320] Received status > update TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.512032 1341 status_update_manager.cpp:824] Checkpointing > UPDATE for status update TASK_KILLED (UUID: > f9d15955-6c9a-4a73-98c3-97c0128510ba) for task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.513849 1343 slave.cpp:3446] Forwarding the update TASK_KILLED > (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 to master@10.0.5.79:5050 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.528929 1344 status_update_manager.cpp:392] Received status > update acknowledgement (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task > mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:25.529002 1344 status_update_manager.cpp:824] Checkpointing ACK > for status update TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) > for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.199105 1345 isolator.cpp:469] Mounting docker volume mount > point '//var/lib/rexray/volumes/jdef-test-125/data' to > '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/data' > for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.207062 1338 containerizer.cpp:1184] Checkpointing executor's > forked pid 5810 to > '/var/lib/mesos/slave/meta/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/pids/forked.pid' > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.832330 1338 slave.cpp:2689] Got registration for executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 from executor(1)@10.0.2.74:46154 > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.833149 1345 disk.cpp:169] Updating the disk resources for > container 24762d43-2134-475e-b724-caa72110497a to cpus(*):0.1; mem(*):32 > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.833804 1342 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' > to 32MB for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.833871 1340 cpushare.cpp:389] Updated 'cpu.shares' to 102 > (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a > May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: > I0513 15:36:28.835160 1340 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container > 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > 5804 'mesos-logrotate-logger --help=false > --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stdout > --logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB ' > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > 5809 'mesos-logrotate-logger --help=false > --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stderr > --logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB ' > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > 5804 'mesos-logrotate-logger --help=false > --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stdout > --logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB ' > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > 5809 'mesos-logrotate-logger --help=false > --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stderr > --logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB ' > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.374567 30993 slave.cpp:5498] Recovering executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.420411 30990 status_update_manager.cpp:208] Recovering > executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.513164 30994 containerizer.cpp:467] Recovering container > '24762d43-2134-475e-b724-caa72110497a' for executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.533478 30988 mem.cpp:602] Started listening for OOM events for > container 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.534553 30988 mem.cpp:722] Started listening on low memory > pressure events for container 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.535269 30988 mem.cpp:722] Started listening on medium memory > pressure events for container 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.536198 30988 mem.cpp:722] Started listening on critical memory > pressure events for container 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.579385 30988 docker.cpp:859] Skipping recovery of executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > 'a3ad8418-cb77-4705-b353-4b514ceca52c-0000' because it was not launched from > docker containerizer > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.587158 30989 slave.cpp:4527] Sending reconnect request to > executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 at executor(1)@10.0.2.74:46154 > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.588287 30990 slave.cpp:2838] Re-registering executor > 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework > a3ad8418-cb77-4705-b353-4b514ceca52c-0000 > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.589736 30988 disk.cpp:169] Updating the disk resources for > container 24762d43-2134-475e-b724-caa72110497a to cpus(*):0.1; mem(*):32 > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.590117 30990 cpushare.cpp:389] Updated 'cpu.shares' to 102 > (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.591284 30990 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container > 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.595403 30992 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' > to 32MB for container 24762d43-2134-475e-b724-caa72110497a > May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: > I0513 16:58:30.596102 30992 mem.cpp:388] Updated 'memory.limit_in_bytes' to > 32MB for container 24762d43-2134-475e-b724-caa72110497a > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)