[jira] [Commented] (MESOS-9183) IntervalSet up bound is one off
[ https://issues.apache.org/jira/browse/MESOS-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592304#comment-16592304 ] Jie Yu commented on MESOS-9183: --- this is probably due to we convert everthing to `boost::icl::interval_bounds::static_right_open`, causing an overflow in this case. > IntervalSet up bound is one off > - > > Key: MESOS-9183 > URL: https://issues.apache.org/jira/browse/MESOS-9183 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Xudong Ni >Priority: Minor > > the unsigned int 16 range is [0, 65535]; if we tried to set this range, the > set will be "{}" > Example code: > {quote}IntervalSet set; > set += (Bound::closed(0), Bound::closed(65535)); > Results: "{}"; Expected: "[0, 65535]" > {quote} > If we decrease the up bound by 1 to 65534, it work normally. > {quote}IntervalSet set; > set += (Bound::closed(0), Bound::closed(65534)); > Results: "[0, 65535)"; Expected: "[0, 65535)" > {quote} > It appears the the up bound is one off, since the inervalSet is template, > other type may have the same issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9183) IntervalSet up bound is one off
Xudong Ni created MESOS-9183: Summary: IntervalSet up bound is one off Key: MESOS-9183 URL: https://issues.apache.org/jira/browse/MESOS-9183 Project: Mesos Issue Type: Bug Components: stout Reporter: Xudong Ni the unsigned int 16 range is [0, 65535]; if we tried to set this range, the set will be "{}" Example code: {quote}IntervalSet set; set += (Bound::closed(0), Bound::closed(65535)); Results: "{}"; Expected: "[0, 65535]" {quote} If we decrease the up bound by 1 to 65534, it work normally. {quote}IntervalSet set; set += (Bound::closed(0), Bound::closed(65534)); Results: "[0, 65535)"; Expected: "[0, 65535)" {quote} It appears the the up bound is one off, since the inervalSet is template, other type may have the same issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9182) Improve `class Slave` in the allocator.
Meng Zhu created MESOS-9182: --- Summary: Improve `class Slave` in the allocator. Key: MESOS-9182 URL: https://issues.apache.org/jira/browse/MESOS-9182 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Meng Zhu Assignee: Meng Zhu Currently, there are several issues with Slave class in the allocator: (1) Resources on an agent are characterized by two variables: total and allocated. However, these two related fields are currently mutated separately by different member functions, leading to temporary inconsistencies. This is fragile and has produced several odd logic flows. (2) While we track aggregated allocated resources on the agent, we do not know which frameworks those resources are allocated to. This lack of information makes several things difficult. For example, the odd agent removal logic described in MESOS-621. And also, currently, we can not update the framework sorter by simply looking at the Slave class. This leads to convoluted update/tracking (un)allocated resources logic. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8270) Add an agent endpoint to list all active resource providers
[ https://issues.apache.org/jira/browse/MESOS-8270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592022#comment-16592022 ] Chun-Hung Hsiao commented on MESOS-8270: The following documentation patch has been committed and backported to 1.7.0: {noformat} commit 71137e87e2ffd8b7455373b6b02c5e662b244cfa (HEAD -> upstream/master) Author: Benjamin Bannier Date: Fri Aug 24 11:01:57 2018 -0700 Documented the `GET_RESOURCE_PROVIDERS` agent API call. Review: https://reviews.apache.org/r/68504/{noformat} {noformat} commit 3cac4ba05bd76bdb4a3100d34b8151a85592701b (HEAD -> upstream/1.7.x) Author: Benjamin Bannier Date: Fri Aug 24 11:01:57 2018 -0700 Documented the `GET_RESOURCE_PROVIDERS` agent API call. Review: https://reviews.apache.org/r/68504/{noformat} > Add an agent endpoint to list all active resource providers > --- > > Key: MESOS-8270 > URL: https://issues.apache.org/jira/browse/MESOS-8270 > Project: Mesos > Issue Type: Task > Components: agent >Reporter: Jan Schlicht >Assignee: Jan Schlicht >Priority: Major > Labels: mesosphere > Fix For: 1.5.0 > > > Operators/Frameworks might need information about all resource providers > currently running on an agent. An API endpoint should provide that > information and include resource provider name and type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8158) Mesos Agent in docker neglects to retry discovering Task docker containers
[ https://issues.apache.org/jira/browse/MESOS-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591768#comment-16591768 ] Sjoerd Mulder commented on MESOS-8158: -- I'm also experiencing this with similar setup (mesos-agent inside docker with docker_mesos_image flag) mesos using version 1.6.1 > Mesos Agent in docker neglects to retry discovering Task docker containers > -- > > Key: MESOS-8158 > URL: https://issues.apache.org/jira/browse/MESOS-8158 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.4.0 > Environment: Windows 10 with Docker version 17.09.0-ce, build afdb6d4 >Reporter: Charles R Allen >Priority: Major > > I have attempted to launch Mesos agents inside of a docker container in such > a way where the agent docker can be replaced and recovered. Unfortunately I > hit a major snag in the way the mesos docker launching works. > To test simple functionality a marathon app is setup that simply has the > following command: {{date && python -m SimpleHTTPServer $PORT0}} > That way the HTTP port can be accessed to assure things are being assigned > correctly, and the date is printed out in the log. > When I attempt to start this marathon app, the mesos agent (inside a docker > container) properly launches an executor which properly creates a second task > that launches the python code. Here's the output from the executor logs (this > looks correct): > {code} > I1101 20:34:03.420210 68270 exec.cpp:162] Version: 1.4.0 > I1101 20:34:03.427455 68281 exec.cpp:237] Executor registered on agent > d9bb6e96-ee26-43c2-977e-0c404fdd4e81-S0 > I1101 20:34:03.428414 68283 executor.cpp:120] Registered docker executor on > 10.0.75.2 > I1101 20:34:03.428680 68281 executor.cpp:160] Starting task > testapp.fe35282f-bf43-11e7-a24b-0242ac110002 > I1101 20:34:03.428941 68281 docker.cpp:1080] Running docker -H > unix:///var/run/docker.sock run --cpu-shares 1024 --memory 134217728 -e > HOST=10.0.75.2 -e MARATHON_APP_DOCKER_IMAGE=python:2 -e > MARATHON_APP_ID=/testapp -e MARATHON_APP_LABELS= -e MARATHON_APP_RESOURCE_CPUS > =1.0 -e MARATHON_APP_RESOURCE_DISK=0.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e > MARATHON_APP_RESOURCE_MEM=128.0 -e > MARATHON_APP_VERSION=2017-11-01T20:33:44.869Z -e > MESOS_CONTAINER_NAME=mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75 -e > MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_TA > SK_ID=testapp.fe35282f-bf43-11e7-a24b-0242ac110002 -e PORT=31464 -e > PORT0=31464 -e PORTS=31464 -e PORT_1=31464 -e PORT_HTTP=31464 -v > /var/run/mesos/slaves/d9bb6e96-ee26-43c2-977e-0c404fdd4e81-S0/frameworks/a5eb6da1-f8ac-4642-8d66-cdd2e5b14d45-0001/executors/testapp > .fe35282f-bf43-11e7-a24b-0242ac110002/runs/84f9ae30-9d4c-484a-860c-ca7845b7ec75:/mnt/mesos/sandbox > --net host --entrypoint /bin/sh --name > mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75 > --label=MESOS_TASK_ID=testapp.fe35282f-bf43-11e7-a24b-0242ac110002 python:2 > -c date && p > ython -m SimpleHTTPServer $PORT0 > I1101 20:34:03.430402 68281 docker.cpp:1243] Running docker -H > unix:///var/run/docker.sock inspect mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75 > I1101 20:34:03.520303 68286 docker.cpp:1290] Retrying inspect with non-zero > status code. cmd: 'docker -H unix:///var/run/docker.sock inspect > mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75', interval: 500ms > I1101 20:34:04.021216 68288 docker.cpp:1243] Running docker -H > unix:///var/run/docker.sock inspect mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75 > I1101 20:34:04.124490 68281 docker.cpp:1290] Retrying inspect with non-zero > status code. cmd: 'docker -H unix:///var/run/docker.sock inspect > mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75', interval: 500ms > I1101 20:34:04.624964 68288 docker.cpp:1243] Running docker -H > unix:///var/run/docker.sock inspect mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75 > I1101 20:34:04.934087 68286 docker.cpp:1345] Retrying inspect since container > not yet started. cmd: 'docker -H unix:///var/run/docker.sock inspect > mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75', interval: 500ms > I1101 20:34:05.435145 68288 docker.cpp:1243] Running docker -H > unix:///var/run/docker.sock inspect mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75 > Wed Nov 1 20:34:06 UTC 2017 > {code} > But, somehow there is a TASK_FAILED message sent to marathon. > Upon further investigation, the following snippet can be found in the agent > logs (running in a docker container) > {code} > I1101 20:34:00.949129 9 slave.cpp:1736] Got assigned task > 'testapp.fe35282f-bf43-11e7-a24b-0242ac110002' for framework > a5eb6da1-f8ac-4642-8d66-cdd2e5b14d45-0001 > I1101 20:34:00.950150 9 gc.cpp:93] Unscheduling > '/var/run/mesos/slaves/d9bb6e96-ee26-43c2-977e-0c404fdd4e81-S0
[jira] [Assigned] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks
[ https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-9131: Assignee: Andrei Budnik (was: Qian Zhang) > Health checks launching nested containers while a container is being > destroyed lead to unkillable tasks > --- > > Key: MESOS-9131 > URL: https://issues.apache.org/jira/browse/MESOS-9131 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Affects Versions: 1.5.1 >Reporter: Jan Schlicht >Assignee: Andrei Budnik >Priority: Blocker > Labels: container-stuck > > A container might get stuck in {{DESTROYING}} state if there's a command > health check that starts new nested containers while its parent container is > getting destroyed. > Here are some logs which unrelated lines removed. The > `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping > afterwards. > {noformat} > 2018-04-16 12:37:54: I0416 12:37:54.235877 3863 containerizer.cpp:2807] > Container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has > exited > 2018-04-16 12:37:54: I0416 12:37:54.235914 3863 containerizer.cpp:2354] > Destroying container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in > RUNNING state > 2018-04-16 12:37:54: I0416 12:37:54.235932 3863 containerizer.cpp:2968] > Transitioning the state of container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 > from RUNNING to DESTROYING > 2018-04-16 12:37:54: I0416 12:37:54.236100 3852 linux_launcher.cpp:514] > Asked to destroy container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.237671 3852 linux_launcher.cpp:560] > Using freezer to destroy cgroup > mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.240327 3852 cgroups.cpp:3060] Freezing > cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.244179 3852 cgroups.cpp:1415] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > after 3.814144ms > 2018-04-16 12:37:54: I0416 12:37:54.250550 3853 cgroups.cpp:3078] Thawing > cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.256599 3853 cgroups.cpp:1444] > Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > after 5.977856ms > ... > 2018-04-16 12:37:54: I0416 12:37:54.371117 3837 http.cpp:3502] Processing > LAUNCH_NESTED_CONTAINER_SESSION call for container > 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd' > 2018-04-16 12:37:54: W0416 12:37:54.371692 3842 http.cpp:2758] Failed to > launch container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd: > Parent container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is > in 'DESTROYING' state > 2018-04-16 12:37:54: W0416 12:37:54.371826 3840 containerizer.cpp:2337] > Attempted to destroy unknown container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd > ... > 2018-04-16 12:37:55: I0416 12:37:55.504456 3856 http.cpp:3078] Processing > REMOVE_NESTED_CONTAINER call for container > 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-f3a1238c-7f0f-4db3-bda4-c0ea951d46b6' > ... > 2018-04-16 12:37:55: I0416 12:37:55.556367 3857 http.cpp:3502] Processing > LAUNCH_NESTED_CONTAINER_SESSION call for container > 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211' > ... > 2018-04-16 12:37:55: W0416 12:37:55.582137 3850 http.cpp:2758] Failed to > launch container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211: > Parent container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is > in 'DESTROYING' state > ... > 2018-04-16 12:37:55: W0416 12:37:
[jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591235#comment-16591235 ] Qian Zhang edited comment on MESOS-8568 at 8/24/18 7:05 AM: I ran the exactly same reproduce steps with the above patch applied, and found this issue was gone, there is only one check container's sandbox directory at any time. {code:java} $ ls -la /home/qzhang/opt/mesos/slaves/1eada535-3848-4c76-b8c5-0e9e0d6fa102-S0/frameworks/8a842ab3-8aba-4d64-a744-ae98bdcf6d59-/executors/default-executor/runs/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/containers/06e7c625-596c-454c-b092-f17a81073349/containers | grep check | wc -l {code} Here is the agent log, we can see `WAIT_NESTED_CONTAINER` was called before `REMOVE_NESTED_CONTAINER` was called. {code:java} I0823 19:46:39.269901 32604 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:39.277669 32603 switchboard.cpp:316] Container logger module finished preparing container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18; IOSwitchboard server is required I0823 19:46:39.284180 32603 systemd.cpp:98] Assigned child process '34701' to 'mesos_executors.slice' I0823 19:46:39.284451 32603 switchboard.cpp:604] Created I/O switchboard server (pid: 34701) listening on socket file '/tmp/mesos-io-switchboard-12e8e4c7-268e-4184-881c-a16b61fa260c' for container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.288053 32641 linux_launcher.cpp:492] Launching nested container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 and cloning with namespaces W0823 19:46:39.302271 32636 http.cpp:2635] Failed to launch container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18: Collect failed: ==Fake error== I0823 19:46:39.304822 32639 linux_launcher.cpp:580] Asked to destroy container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.305047 32639 linux_launcher.cpp:622] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:39.306437 32646 cgroups.cpp:2838] Freezing cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.307015 32614 cgroups.cpp:1229] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 after 419840ns I0823 19:46:39.307715 32641 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42086 I0823 19:46:39.308198 32646 cgroups.cpp:2856] Thawing cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.308298 32641 http.cpp:2685] Processing WAIT_NESTED_CONTAINER call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:39.308583 32605 cgroups.cpp:1258] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 after 265728ns I0823 19:46:39.373747 32616 linux_launcher.cpp:654] Destroying cgroup '/sys/fs/cgroup/systemd/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:44.375650 32647 switchboard.cpp:807] Sending SIGTERM to I/O switchboard server (pid: 34701) since container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 is being destroyed I0823 19:46:44.403535 32637 switchboard.cpp:913] I/O switchboard server process for container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 has terminated (status=0) I0823 19:46:47.420578 32622 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42088 I0823 19:46:47.421331 32622 http.cpp:2971] Processing REMOVE_NESTED_CONTAINER call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:47.42
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591235#comment-16591235 ] Qian Zhang commented on MESOS-8568: --- I ran the exactly same reproduce steps with the above patch applied, and found this issue was gone, there is only one check container's sandbox directory at any time. {code:java} $ ls -la /home/qzhang/opt/mesos/slaves/1eada535-3848-4c76-b8c5-0e9e0d6fa102-S0/frameworks/8a842ab3-8aba-4d64-a744-ae98bdcf6d59-/executors/default-executor/runs/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/containers/06e7c625-596c-454c-b092-f17a81073349/containers | grep check | wc -l {code} Here is the agent log, we can see `WAIT_NESTED_CONTAINER` was called before `REMOVE_NESTED_CONTAINER` was called. {code:java} I0823 19:46:39.269901 32604 http.cpp:3366] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:39.277669 32603 switchboard.cpp:316] Container logger module finished preparing container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18; IOSwitchboard server is required I0823 19:46:39.284180 32603 systemd.cpp:98] Assigned child process '34701' to 'mesos_executors.slice' I0823 19:46:39.284451 32603 switchboard.cpp:604] Created I/O switchboard server (pid: 34701) listening on socket file '/tmp/mesos-io-switchboard-12e8e4c7-268e-4184-881c-a16b61fa260c' for container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.288053 32641 linux_launcher.cpp:492] Launching nested container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 and cloning with namespaces W0823 19:46:39.302271 32636 http.cpp:2635] Failed to launch container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18: Collect failed: ==Fake error== I0823 19:46:39.304822 32639 linux_launcher.cpp:580] Asked to destroy container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.305047 32639 linux_launcher.cpp:622] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:39.306437 32646 cgroups.cpp:2838] Freezing cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.307015 32614 cgroups.cpp:1229] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 after 419840ns I0823 19:46:39.307715 32641 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42086 I0823 19:46:39.308198 32646 cgroups.cpp:2856] Thawing cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 I0823 19:46:39.308298 32641 http.cpp:2685] Processing WAIT_NESTED_CONTAINER call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:39.308583 32605 cgroups.cpp:1258] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18 after 265728ns I0823 19:46:39.373747 32616 linux_launcher.cpp:654] Destroying cgroup '/sys/fs/cgroup/systemd/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:44.375650 32647 switchboard.cpp:807] Sending SIGTERM to I/O switchboard server (pid: 34701) since container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 is being destroyed I0823 19:46:44.403535 32637 switchboard.cpp:913] I/O switchboard server process for container 9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18 has terminated (status=0) I0823 19:46:47.420578 32622 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 10.0.49.2:42088 I0823 19:46:47.421331 32622 http.cpp:2971] Processing REMOVE_NESTED_CONTAINER call for container '9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18' I0823 19:46:47.427382 32636 http.cpp:1117] HTTP POST for /slave