Greg Mann created MESOS-9306:
--------------------------------

             Summary: Mesos containerizer can get stuck during cgroup cleanup
                 Key: MESOS-9306
                 URL: https://issues.apache.org/jira/browse/MESOS-9306
             Project: Mesos
          Issue Type: Bug
          Components: agent, containerization
    Affects Versions: 1.7.0
            Reporter: Greg Mann


I observed a task group's executor container which failed to be completely 
destroyed after its associated tasks were killed. The following is an excerpt 
from the agent log which is filtered to include only lines with the container 
ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}:
{code}
2018-10-10 14:20:50: I1010 14:20:50.204756  6799 containerizer.cpp:2963] 
Container d463b9fe-970d-4077-bab9-558464889a9e has exited
2018-10-10 14:20:50: I1010 14:20:50.204839  6799 containerizer.cpp:2457] 
Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state
2018-10-10 14:20:50: I1010 14:20:50.204859  6799 containerizer.cpp:3124] 
Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e from 
RUNNING to DESTROYING
2018-10-10 14:20:50: I1010 14:20:50.204960  6799 linux_launcher.cpp:580] Asked 
to destroy container d463b9fe-970d-4077-bab9-558464889a9e
2018-10-10 14:20:50: I1010 14:20:50.204993  6799 linux_launcher.cpp:622] 
Destroying cgroup 
'/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
2018-10-10 14:20:50: I1010 14:20:50.205417  6806 cgroups.cpp:2838] Freezing 
cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
2018-10-10 14:20:50: I1010 14:20:50.205477  6810 cgroups.cpp:2838] Freezing 
cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
2018-10-10 14:20:50: I1010 14:20:50.205708  6808 cgroups.cpp:1229] Successfully 
froze cgroup 
/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
203008ns
2018-10-10 14:20:50: I1010 14:20:50.205878  6800 cgroups.cpp:1229] Successfully 
froze cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e 
after 339200ns
2018-10-10 14:20:50: I1010 14:20:50.206185  6799 cgroups.cpp:2856] Thawing 
cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
2018-10-10 14:20:50: I1010 14:20:50.206226  6808 cgroups.cpp:2856] Thawing 
cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
2018-10-10 14:20:50: I1010 14:20:50.206455  6808 cgroups.cpp:1258] Successfully 
thawed cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e 
after 83968ns
2018-10-10 14:20:50: I1010 14:20:50.306803  6810 cgroups.cpp:1258] Successfully 
thawed cgroup 
/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
100.50816ms
2018-10-10 14:20:50: I1010 14:20:50.307531  6805 linux_launcher.cpp:654] 
Destroying cgroup 
'/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
2018-10-10 14:21:40: W1010 14:21:40.032855  6809 containerizer.cpp:2401] 
Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
Container does not exist
2018-10-10 14:22:40: W1010 14:22:40.031224  6800 containerizer.cpp:2401] 
Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
Container does not exist
2018-10-10 14:23:40: W1010 14:23:40.031946  6799 containerizer.cpp:2401] 
Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
Container does not exist
2018-10-10 14:24:40: W1010 14:24:40.032979  6804 containerizer.cpp:2401] 
Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
Container does not exist
2018-10-10 14:25:40: W1010 14:25:40.030784  6808 containerizer.cpp:2401] 
Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
Container does not exist
2018-10-10 14:26:40: W1010 14:26:40.032526  6810 containerizer.cpp:2401] 
Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
Container does not exist
2018-10-10 14:27:40: W1010 14:27:40.029932  6801 containerizer.cpp:2401] 
Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
Container does not exist
{code}
The last log line from the containerizer's destroy path is:
{code}
14:20:50.307531  6805 linux_launcher.cpp:654] Destroying cgroup 
'/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
{code}
(that is the second such log line, from {{LinuxLauncherProcess::_destroy}})
Then we just see
{code}
containerizer.cpp:2401] Skipping status for container 
d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
{code}
repeatedly, which occurs because the agent's {{GET_CONTAINERS}} call is being 
polled once per minute. This seems to indicate that the container in question 
is still in the agent's {{containers_}} map.

So, it seems that the containerizer is stuck either in the Linux launcher's 
{{destroy()}} code path, or the containerizer's {{destroy()}} code path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to