[ https://issues.apache.org/jira/browse/MESOS-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907981#comment-16907981 ]
Qian Zhang edited comment on MESOS-9836 at 8/15/19 10:28 AM: ------------------------------------------------------------- {quote}A fix would be after obtaining the pid through docker inspect, we look into the proc filesystem and get the cgroup right away and cache it, as well as other properties we originally retrieve through pid if any. {quote} [~chhsia0] Yeah, we could get and catch the cgroup, but we also need to recover it after agent restarts which might be complicated. {quote}Probably we should leave out all cgroups not containing "docker" substring instead of (or in addition to) filtering [the system root cgroup|https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1783-L1788]. It's ugly, hacky, and introduces a dependency on Docker's runtime. {quote} [~abudnik] I think a typical cgroup for Docker container is like: {code:java} /sys/fs/cgroup/<subsystem>/docker/<Docker-container-ID> {code} Are you saying in `DockerContainerizerProcess::__update` we do not update anything if the cgroup does not contain "docker" substring (i.e., it is not a Docker container)? Yeah, it should work, but as you mentioned it will introduce a dependency on Docker which is unfortunate. I am actually wondering why we need to update Docker container's CPU & memory in Docker containerizer. For the task launched via Docker containerizer, the method `DockerContainerizerProcess::update` will be called twice: # When Docker executor registers agent (see [here|https://github.com/apache/mesos/blob/1.8.1/src/slave/slave.cpp#L5210] for details): The resources passed into `DockerContainerizerProcess::update` is same as the container's resources, so we will not update anything (see [here|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.cpp#L1675:L1680] for details). # When agent receives a terminal status update for the task (see [here|https://github.com/apache/mesos/blob/1.8.1/src/slave/slave.cpp#L5841:L5846] for details): Since the container has terminated, what `docker inspect` returns will not contain a pid, so we will not update anything as well (see [here|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.cpp#L1754:L1756] for details). The reason that we have the issue mentioned in this ticket is, between #1 and #2 (i.e., the Docker container is stilling running) `DockerContainerizerProcess::usage` is called (e.g., agent's /monitor/statistics endpoint is hit), internally it calls `docker inspect` which will return a valid pid, and we will cache the pid (see [here|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.cpp#L2025] for details). And after that but before #2, that pid may be reused by a process which is agent process's child, so we will overwrite agent's cgroup in #2. It is Docker executor to launch task as a Docker container, so it is a bit weird for Docker containerizer to directly update the Docker container's resources in cgroups which should be the job of Docker executor. And I do not see why we need to support updating Docker container's resource at runtime. So what is the purpose of Docker containerizer's update method? To me, it should just do nothing. was (Author: qianzhang): {quote}A fix would be after obtaining the pid through docker inspect, we look into the proc filesystem and get the cgroup right away and cache it, as well as other properties we originally retrieve through pid if any. {quote} [~chhsia0] Yeah, we could get and catch the cgroup, but we also need to recover it after agent restarts which might be complicated. {quote}Probably we should leave out all cgroups not containing "docker" substring instead of (or in addition to) filtering [the system root cgroup|https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1783-L1788]. It's ugly, hacky, and introduces a dependency on Docker's runtime. {quote} [~abudnik] I think a typical cgroup for Docker container is like: {code:java} /sys/fs/cgroup/<subsystem>/docker/<Docker-container-ID> {code} Are you saying in `DockerContainerizerProcess::__update` we do not update anything if the cgroup does not contain "docker" substring (i.e., it is not a Docker container)? Yeah, it should work, but as you mentioned it will introduce a dependency on Docker which is unfortunate. I am actually wondering why we need to update Docker container's CPU & memory in Docker containerizer. For the task launched via Docker containerizer, the method `DockerContainerizerProcess::update` will be called twice: # When Docker executor registers agent (see [here|https://github.com/apache/mesos/blob/1.8.1/src/slave/slave.cpp#L5210] for details): The resources passed into `DockerContainerizerProcess::update` is same as the container's resources, so we will not update anything (see [here|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.cpp#L1675:L1680] for details). # When agent receives a terminal status update for the task (see [here|https://github.com/apache/mesos/blob/1.8.1/src/slave/slave.cpp#L5841:L5846] for details): Since the container has terminated, what `docker inspect` returns will not contain a pid, so we will not update anything as well (see [here|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.cpp#L1754:L1756] for details). The reason that we have the issue mentioned in this ticket is, between #1 and #2 (i.e., the Docker container is stilling running) `DockerContainerizerProcess::usage` is called (e.g., agent's /monitor/statistics endpoint is hit), internally it calls `docker inspect` which will return a valid pid, and we will cache the pid (see [here|https://github.com/apache/mesos/blob/1.8.1/src/slave/containerizer/docker.cpp#L2025] for details). And after that but before #2, that pid may be reused by a process which is agent process's child, so we will overwrite agent's cgroup in #2. It is Docker executor to launch task as a Docker container, so it is a bit weird for Docker containerizer to directly update the Docker container's resources in cgroups which should be the job of Docker executor. And I do not see why we need to support updating Docker container's resource at runtime. So what is the purpose of Docker containerizer's update method? To me, it should just do nothing. > Docker containerizer overwrites `/mesos/slave` cgroups. > ------------------------------------------------------- > > Key: MESOS-9836 > URL: https://issues.apache.org/jira/browse/MESOS-9836 > Project: Mesos > Issue Type: Bug > Components: containerization > Reporter: Chun-Hung Hsiao > Priority: Critical > Labels: docker, mesosphere > > The following bug was observed on our internal testing cluster. > The docker containerizer launched a container on an agent: > {noformat} > I0523 06:00:53.888579 21815 docker.cpp:1195] Starting container > 'f69c8a8c-eba4-4494-a305-0956a44a6ad2' for task > 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1' (and executor > 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1') of framework > 415284b7-2967-407d-b66f-f445e93f064e-0011 > I0523 06:00:54.524171 21815 docker.cpp:783] Checkpointing pid 13716 to > '/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S2/frameworks/415284b7-2967-407d-b66f-f445e93f064e-0011/executors/apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1/runs/f69c8a8c-eba4-4494-a305-0956a44a6ad2/pids/forked.pid' > {noformat} > After the container was launched, the docker containerizer did a {{docker > inspect}} on the container and cached the pid: > > [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1764] > The pid should be slightly greater than 13716. > The docker executor sent a {{TASK_FINISHED}} status update around 16 minutes > later: > {noformat} > I0523 06:16:17.287595 21809 slave.cpp:5566] Handling status update > TASK_FINISHED (Status UUID: 4e00b786-b773-46cd-8327-c7deb08f1de9) for task > apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1 of framework > 415284b7-2967-407d-b66f-f445e93f064e-0011 from executor(1)@172.31.1.7:36244 > {noformat} > After receiving the terminal status update, the agent asked the docker > containerizer to update {{cpu.cfs_period_us}}, {{cpu.cfs_quota_us}} and > {{memory.soft_limit_in_bytes}} of the container through the cached pid: > > [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1696] > {noformat} > I0523 06:16:17.290447 21815 docker.cpp:1868] Updated 'cpu.shares' to 102 at > /sys/fs/cgroup/cpu,cpuacct/mesos/slave for container > f69c8a8c-eba4-4494-a305-0956a44a6ad2 > I0523 06:16:17.290660 21815 docker.cpp:1895] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container > f69c8a8c-eba4-4494-a305-0956a44a6ad2 > I0523 06:16:17.889816 21815 docker.cpp:1937] Updated > 'memory.soft_limit_in_bytes' to 32MB for container > f69c8a8c-eba4-4494-a305-0956a44a6ad2 > {noformat} > Note that the cgroup of {{cpu.shares}} was {{/mesos/slave}}. This was > possibly because that over the 16 minutes the pid got reused: > {noformat} > # zgrep 'systemd.cpp:98\]' /var/log/mesos/archive/mesos-agent.log.12.gz > ... > I0523 06:00:54.525178 21815 systemd.cpp:98] Assigned child process '13716' to > 'mesos_executors.slice' > I0523 06:00:55.078546 21808 systemd.cpp:98] Assigned child process '13798' to > 'mesos_executors.slice' > I0523 06:00:55.134096 21808 systemd.cpp:98] Assigned child process '13799' to > 'mesos_executors.slice' > ... > I0523 06:06:30.997439 21808 systemd.cpp:98] Assigned child process '32689' to > 'mesos_executors.slice' > I0523 06:06:31.050976 21808 systemd.cpp:98] Assigned child process '32690' to > 'mesos_executors.slice' > I0523 06:06:31.110514 21815 systemd.cpp:98] Assigned child process '32692' to > 'mesos_executors.slice' > I0523 06:06:33.143726 21818 systemd.cpp:98] Assigned child process '446' to > 'mesos_executors.slice' > I0523 06:06:33.196251 21818 systemd.cpp:98] Assigned child process '447' to > 'mesos_executors.slice' > I0523 06:06:33.266332 21816 systemd.cpp:98] Assigned child process '449' to > 'mesos_executors.slice' > ... > I0523 06:09:34.870056 21808 systemd.cpp:98] Assigned child process '13717' to > 'mesos_executors.slice' > I0523 06:09:34.937762 21813 systemd.cpp:98] Assigned child process '13744' to > 'mesos_executors.slice' > I0523 06:09:35.073971 21817 systemd.cpp:98] Assigned child process '13754' to > 'mesos_executors.slice' > ... > {noformat} > It was highly likely that the container itself exited around 06:09:35, way > before the docker executor detected and reported the terminal status update, > and then its pid was reused by another forked child of the agent, and thus > {{cpu.cfs_period_us}}, {{cpu.quota_us}} and {{memory.soft_limit_in_bytes}} of > the {{/mesos/slave}} cgroup was mistakenly overwritten. -- This message was sent by Atlassian JIRA (v7.6.14#76016)