[
https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gilbert Song reassigned MESOS-9306:
-----------------------------------
Assignee: Andrei Budnik
> Mesos containerizer can get stuck during cgroup cleanup
> -------------------------------------------------------
>
> Key: MESOS-9306
> URL: https://issues.apache.org/jira/browse/MESOS-9306
> Project: Mesos
> Issue Type: Bug
> Components: agent, containerization
> Affects Versions: 1.7.0
> Reporter: Greg Mann
> Assignee: Andrei Budnik
> Priority: Critical
> Labels: containerizer, mesosphere
>
> I observed a task group's executor container which failed to be completely
> destroyed after its associated tasks were killed. The following is an excerpt
> from the agent log which is filtered to include only lines with the container
> ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}:
> {code}
> 2018-10-10 14:20:50: I1010 14:20:50.204756 6799 containerizer.cpp:2963]
> Container d463b9fe-970d-4077-bab9-558464889a9e has exited
> 2018-10-10 14:20:50: I1010 14:20:50.204839 6799 containerizer.cpp:2457]
> Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state
> 2018-10-10 14:20:50: I1010 14:20:50.204859 6799 containerizer.cpp:3124]
> Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e
> from RUNNING to DESTROYING
> 2018-10-10 14:20:50: I1010 14:20:50.204960 6799 linux_launcher.cpp:580]
> Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.204993 6799 linux_launcher.cpp:622]
> Destroying cgroup
> '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:20:50: I1010 14:20:50.205417 6806 cgroups.cpp:2838] Freezing
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.205477 6810 cgroups.cpp:2838] Freezing
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.205708 6808 cgroups.cpp:1229]
> Successfully froze cgroup
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after
> 203008ns
> 2018-10-10 14:20:50: I1010 14:20:50.205878 6800 cgroups.cpp:1229]
> Successfully froze cgroup
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after
> 339200ns
> 2018-10-10 14:20:50: I1010 14:20:50.206185 6799 cgroups.cpp:2856] Thawing
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.206226 6808 cgroups.cpp:2856] Thawing
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.206455 6808 cgroups.cpp:1258]
> Successfully thawed cgroup
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after
> 83968ns
> 2018-10-10 14:20:50: I1010 14:20:50.306803 6810 cgroups.cpp:1258]
> Successfully thawed cgroup
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after
> 100.50816ms
> 2018-10-10 14:20:50: I1010 14:20:50.307531 6805 linux_launcher.cpp:654]
> Destroying cgroup
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:21:40: W1010 14:21:40.032855 6809 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> 2018-10-10 14:22:40: W1010 14:22:40.031224 6800 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> 2018-10-10 14:23:40: W1010 14:23:40.031946 6799 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> 2018-10-10 14:24:40: W1010 14:24:40.032979 6804 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> 2018-10-10 14:25:40: W1010 14:25:40.030784 6808 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> 2018-10-10 14:26:40: W1010 14:26:40.032526 6810 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> 2018-10-10 14:27:40: W1010 14:27:40.029932 6801 containerizer.cpp:2401]
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because:
> Container does not exist
> {code}
> The last log line from the containerizer's destroy path is:
> {code}
> 14:20:50.307531 6805 linux_launcher.cpp:654] Destroying cgroup
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> {code}
> (that is the second such log line, from {{LinuxLauncherProcess::_destroy}})
> Then we just see
> {code}
> containerizer.cpp:2401] Skipping status for container
> d463b9fe-970d-4077-bab9-558464889a9e because: Container does not exist
> {code}
> repeatedly, which occurs because the agent's {{GET_CONTAINERS}} call is being
> polled once per minute. This seems to indicate that the container in question
> is still in the agent's {{containers_}} map.
> So, it seems that the containerizer is stuck either in the Linux launcher's
> {{destroy()}} code path, or the containerizer's {{destroy()}} code path.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)