[ 
https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835580#comment-16835580
 ] 

Andrei Budnik commented on MESOS-9306:
--------------------------------------

I've reproduced the timeout case for `cgroups::destroy` by adding the following 
code
{code:java}
Owned<Promise<Nothing>> promise(new Promise<Nothing>());
return promise->future();
{code}
to the beginning of 
[destroy()|https://github.com/apache/mesos/blob/db7ce35dc155c2de7e66ec051ee0f6bcf784b4e1/src/linux/cgroups.cpp#L1548]
 function. In turns out, that 
[`__destroy`|https://github.com/apache/mesos/blob/db7ce35dc155c2de7e66ec051ee0f6bcf784b4e1/src/linux/cgroups.cpp#L1590-L1602]
 is never invoked due to a missing `onDiscard` handler. We only subscribe 
[`onAny`|https://github.com/apache/mesos/blob/db7ce35dc155c2de7e66ec051ee0f6bcf784b4e1/src/linux/cgroups.cpp#L1613]
 callback, which is never called after calling `future.discard()`.

The reason `cgroups::destroy` hangs for Systemd hierarchy is unknown. It might 
be related to some kernel issue.

> Mesos containerizer can get stuck during cgroup cleanup
> -------------------------------------------------------
>
>                 Key: MESOS-9306
>                 URL: https://issues.apache.org/jira/browse/MESOS-9306
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, containerization
>    Affects Versions: 1.7.0
>            Reporter: Greg Mann
>            Assignee: Andrei Budnik
>            Priority: Critical
>              Labels: containerizer, mesosphere
>
> I observed a task group's executor container which failed to be completely 
> destroyed after its associated tasks were killed. The following is an excerpt 
> from the agent log which is filtered to include only lines with the container 
> ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}:
> {code}
> 2018-10-10 14:20:50: I1010 14:20:50.204756  6799 containerizer.cpp:2963] 
> Container d463b9fe-970d-4077-bab9-558464889a9e has exited
> 2018-10-10 14:20:50: I1010 14:20:50.204839  6799 containerizer.cpp:2457] 
> Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state
> 2018-10-10 14:20:50: I1010 14:20:50.204859  6799 containerizer.cpp:3124] 
> Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e 
> from RUNNING to DESTROYING
> 2018-10-10 14:20:50: I1010 14:20:50.204960  6799 linux_launcher.cpp:580] 
> Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.204993  6799 linux_launcher.cpp:622] 
> Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:20:50: I1010 14:20:50.205417  6806 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.205477  6810 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.205708  6808 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 203008ns
> 2018-10-10 14:20:50: I1010 14:20:50.205878  6800 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 339200ns
> 2018-10-10 14:20:50: I1010 14:20:50.206185  6799 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.206226  6808 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.206455  6808 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 83968ns
> 2018-10-10 14:20:50: I1010 14:20:50.306803  6810 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 100.50816ms
> 2018-10-10 14:20:50: I1010 14:20:50.307531  6805 linux_launcher.cpp:654] 
> Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:21:40: W1010 14:21:40.032855  6809 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:22:40: W1010 14:22:40.031224  6800 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:23:40: W1010 14:23:40.031946  6799 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:24:40: W1010 14:24:40.032979  6804 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:25:40: W1010 14:25:40.030784  6808 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:26:40: W1010 14:26:40.032526  6810 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:27:40: W1010 14:27:40.029932  6801 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> {code}
> The last log line from the containerizer's destroy path is:
> {code}
> 14:20:50.307531  6805 linux_launcher.cpp:654] Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> {code}
> (that is the second such log line, from {{LinuxLauncherProcess::_destroy}})
> Then we just see
> {code}
> containerizer.cpp:2401] Skipping status for container 
> d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> {code}
> repeatedly, which occurs because the agent's {{GET_CONTAINERS}} call is being 
> polled once per minute. This seems to indicate that the container in question 
> is still in the agent's {{containers_}} map.
> So, it seems that the containerizer is stuck either in the Linux launcher's 
> {{destroy()}} code path, or the containerizer's {{destroy()}} code path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to