[
https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821636#comment-16821636
]
Qian Zhang commented on MESOS-9306:
-----------------------------------
It seems isolator's `cleanup` was not called for this container. In Mesos agent
log, I see the following CNI related messages for anther container
`2a31c63b-370c-4e59-b0f4-bb8320017c3f`:
{code:java}
2018-10-10 14:14:24: I1010 14:14:24.274730 6805 cni.cpp:960] Bind mounted
'/proc/26188/ns/net' to
'/run/mesos/isolators/network/cni/2a31c63b-370c-4e59-b0f4-bb8320017c3f/ns' for
container 2a31c63b-370c-4e59-b0f4-bb8320017c3f
2018-10-10 14:14:24: I1010 14:14:24.505472 6804 cni.cpp:1394] Got assigned IPv4
address '9.0.1.5/25' from CNI network 'dcos' for container
2a31c63b-370c-4e59-b0f4-bb8320017c3f
2018-10-10 14:14:24: I1010 14:14:24.505879 6810 cni.cpp:1100] Unable to find
DNS nameservers for container 2a31c63b-370c-4e59-b0f4-bb8320017c3f, using host
'/etc/resolv.conf'
2018-10-10 14:20:50: I1010 14:20:50.317416 6808 cni.cpp:1670] Unmounted the
network namespace handle
'/run/mesos/isolators/network/cni/2a31c63b-370c-4e59-b0f4-bb8320017c3f/ns' for
container 2a31c63b-370c-4e59-b0f4-bb8320017c3f
2018-10-10 14:20:50: I1010 14:20:50.317649 6808 cni.cpp:1682] Removed the
container directory
'/run/mesos/isolators/network/cni/2a31c63b-370c-4e59-b0f4-bb8320017c3f'
{code}
The first 3 lines were from CNI isolator's `isolate` method, and the last 2
lines were from CNI isolator's `cleanup` method, this is the correct behavior.
But for this container `d463b9fe-970d-4077-bab9-558464889a9e`, I only see:
{code:java}
2018-10-10 14:17:16: I1010 14:17:16.536072 6804 cni.cpp:960] Bind mounted
'/proc/30145/ns/net' to
'/run/mesos/isolators/network/cni/d463b9fe-970d-4077-bab9-558464889a9e/ns' for
container d463b9fe-970d-4077-bab9-558464889a9e
2018-10-10 14:17:16: I1010 14:17:16.743510 6802 cni.cpp:1394] Got assigned IPv4
address '9.0.1.7/25' from CNI network 'dcos' for container
d463b9fe-970d-4077-bab9-558464889a9e
2018-10-10 14:17:16: I1010 14:17:16.743850 6802 cni.cpp:1100] Unable to find
DNS nameservers for container d463b9fe-970d-4077-bab9-558464889a9e, using host
'/etc/resolv.conf'
{code}
So it seems CNI isolator's `cleanup` method was not called. Basically CNI
isolator should be the first one called in containerizer's destroy code path
after Linux launcher's destroy is done, so it seems
`MesosContainerizerProcess::cleanupIsolators` was not called for this container
somehow.
> Mesos containerizer can get stuck during cgroup cleanup
> -------------------------------------------------------
>
> Key: MESOS-9306
> URL: https://issues.apache.org/jira/browse/MESOS-9306
> Project: Mesos
> Issue Type: Bug
> Components: agent, containerization
> Affects Versions: 1.7.0
> Reporter: Greg Mann
> Priority: Critical
> Labels: containerizer, mesosphere
>
> I observed a task group's executor container which failed to be completely
> destroyed after its associated tasks were killed. The following is an excerpt
> from the agent log which is filtered to include only lines with the container
> ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}:
> {code}
> 2018-10-10 14:20:50: I1010 14:20:50.204756 6799 containerizer.cpp:2963]
> Container d463b9fe-970d-4077-bab9-558464889a9e has exited
> 2018-10-10 14:20:50: I1010 14:20:50.204839 6799 containerizer.cpp:2457]
> Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state
> 2018-10-10 14:20:50: I1010 14:20:50.204859 6799 containerizer.cpp:3124]
> Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e
> from RUNNING to DESTROYING
> 2018-10-10 14:20:50: I1010 14:20:50.204960 6799 linux_launcher.cpp:580]
> Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.204993 6799 linux_launcher.cpp:622]
> Destroying cgroup
> '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:20:50: I1010 14:20:50.205417 6806 cgroups.cpp:2838] Freezing
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.205477 6810 cgroups.cpp:2838] Freezing
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.205708 6808 cgroups.cpp:1229]
> Successfully froze cgroup
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after
> 203008ns
> 2018-10-10 14:20:50: I1010 14:20:50.205878 6800 cgroups.cpp:1229]
> Successfully froze cgroup
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after
> 339200ns
> 2018-10-10 14:20:50: I1010 14:20:50.206185 6799 cgroups.cpp:2856] Thawing
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.206226 6808 cgroups.cpp:2856] Thawing
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.206455 6808 cgroups.cpp:1258]
> Successfully thawed cgroup
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after
> 83968ns
> 2018-10-10 14:20:50: I1010 14:20:50.306803 6810 cgroups.cpp:1258]
> Successfully thawed cgroup
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after
> 100.50816ms
> 2018-10-10 14:20:50: I1010 14:20:50.307531 6805 linux_launcher.cpp:654]
> Destroying cgroup
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:21:40: W1010 14:21:40.032855 6809 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> 2018-10-10 14:22:40: W1010 14:22:40.031224 6800 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> 2018-10-10 14:23:40: W1010 14:23:40.031946 6799 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> 2018-10-10 14:24:40: W1010 14:24:40.032979 6804 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> 2018-10-10 14:25:40: W1010 14:25:40.030784 6808 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> 2018-10-10 14:26:40: W1010 14:26:40.032526 6810 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> 2018-10-10 14:27:40: W1010 14:27:40.029932 6801 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> {code}
> The last log line from the containerizer's destroy path is:
> {code}
> 14:20:50.307531 6805 linux_launcher.cpp:654] Destroying cgroup
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> {code}
> (that is the second such log line, from {{LinuxLauncherProcess::_destroy}})
> Then we just see
> {code}
> containerizer.cpp:2401] Skipping status for container
> d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> {code}
> repeatedly, which occurs because the agent's {{GET_CONTAINERS}} call is being
> polled once per minute. This seems to indicate that the container in question
> is still in the agent's {{containers_}} map.
> So, it seems that the containerizer is stuck either in the Linux launcher's
> {{destroy()}} code path, or the containerizer's {{destroy()}} code path.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)