Greg Mann created MESOS-9709:
--------------------------------
Summary: Docker executor can become stuck terminating
Key: MESOS-9709
URL: https://issues.apache.org/jira/browse/MESOS-9709
Project: Mesos
Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Greg Mann
Attachments: docker-executor-stuck.txt
See attached agent log; the executor container ID is
{{d2bfec33-f6bd-44ee-9345-b5710780bb59}} and the executor ID contains the
string {{819f7ef7-4f42-11e9-a566-72ec67496045}}.
After launching the executor, we see
{code}
Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re
mesos-agent[10238]: I0329 18:23:36.967316 10257 slave.cpp:3550] Launching
container d2bfec33-f6bd-44ee-9345-b5710780bb59 for executor
'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of
framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321-0000
Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re
mesos-agent[10238]: I0329 18:23:36.968968 10253 docker.cpp:1161] No container
info found, skipping launch
{code}
I'm not sure why the container info was not set. Once the executor
reregistration timeout elapses, the agent attempts to terminate the executor
but it does not seem to be successful. The scheduler continues to try to kill
the task but we repeatedly see
{code}
Mar 29 18:35:19 int-mountvolumeagent9-soak113s.testing.mesosphe.re
mesos-agent[10238]: W0329 18:35:19.855063 10253 slave.cpp:3823] Ignoring kill
task datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339
because the executor
'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of
framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321-0000 is terminating
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)