[ https://issues.apache.org/jira/browse/MESOS-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629150#comment-16629150 ]
Andrei Budnik edited comment on MESOS-9268 at 9/26/18 5:12 PM: --------------------------------------------------------------- I launched 239 docker containers running `sleep 1000` command on Fedora 25 (16 CPUs, 64 Gb of memory): {code:java} time curl `hostname`:5051/containers ... real 0m21.586s user 0m0.003s sys 0m0.008s{code} Hitting `/containers` endpoint takes ~21 seconds. Note, that this machine was not under heavy load. After I removed the code that [accesses Sysfs|https://github.com/apache/mesos/blob/1.7.x/src/slave/containerizer/docker.cpp#L2010-L2029] and then launched 210 Docker containers, `/containers` endpoint started to respond very fast: {code:java} time curl `hostname`:5051/containers ... real 0m0.082s user 0m0.004s sys 0m0.004s{code} `DockerContainerizerProcess::usage()` is called for every container when `/containers` endpoint is hit. It calls `docker inspect` and then collects some cgroup statistics (by accessing Sysfs). `DockerContainerizerProcess::usage()` spent most of the time in collecting cgroup statistics in aforementioned experiment. was (Author: abudnik): I launched 239 docker containers running `sleep 1000` command on Fedora 25 (16 CPUs, 64 Gb of memory): {code:java} time curl `hostname`:5051/containers ... real 0m21.586s user 0m0.003s sys 0m0.008s{code} Hitting `/containers` endpoint takes ~21 seconds. Note, that this machine was not under heavy load. After I removed the code that [accesses Sysfs|https://github.com/apache/mesos/blob/1.7.x/src/slave/containerizer/docker.cpp#L2010-L2029] and then launched 210 Docker containers, `/containers` endpoint started to respond very quickly: {code:java} time curl `hostname`:5051/containers ... real 0m0.082s user 0m0.004s sys 0m0.004s{code} `DockerContainerizerProcess::usage()` is called for every container when `/containers` endpoint is hit. It calls `docker inspect` and then collects some cgroup statistics (by accessing Sysfs). `DockerContainerizerProcess::usage()` spent most of the time in collecting cgroup statistics in aforementioned experiment. > Hitting agent's `/containers` endpoint might backlog Docker containerizer > process. > ---------------------------------------------------------------------------------- > > Key: MESOS-9268 > URL: https://issues.apache.org/jira/browse/MESOS-9268 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker > Reporter: Andrei Budnik > Priority: Major > > When the agent's `/containers` endpoint is hit, the agent calls > `DockerContainerizerProcess::usage()` method for every running Docker > container. If there are lots (hundreds) of running Docker containers and the > system is under heavy load, then `DockerContainerizerProcess::usage()` method > takes a lot of time to be processed. Hence, when the Docker containerizer's > mailbox is full of `DockerContainerizerProcess::usage()` commands waiting to > be processed, an attempt to launch a new Docker container will lead to > putting `DockerContainerizer::launch()` into a long queue. This is an example > from logs: > {code:java} > 2018-09-07 00:06:03: I0907 00:06:03.031744 65329 slave.cpp:2857] Launching > container 320547f2-9ab9-4ade-bddb-bfd6f5ed0834 for executor > 'example_test2519.ceeefb8a-b231-11e8-8d62-8e820bbfbfd1' of framework > 51b197db-56ae-47dc-a1bd-3aaf2306bd1a-0001 > ... > 2018-09-07 00:36:20: I0907 00:36:20.472048 65307 docker.cpp:1179] Starting > container '320547f2-9ab9-4ade-bddb-bfd6f5ed0834' for task > 'example_test2519.ceeefb8a-b231-11e8-8d62-8e820bbfbfd1' (and executor > 'middleware_test2519.ceeefb8a-b231-11e8-8d62-8e820bbfbfd1') of framework > 51b197db-56ae-47dc-a1bd-3aaf2306bd1a-0001 > {code} > After disabling `/containers` endpoint, the issue with a backlogged Docker > containerizer disappears. -- This message was sent by Atlassian JIRA (v7.6.3#76005)