[jira] [Commented] (MESOS-9183) IntervalSet up bound is one off

2018-08-24 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592304#comment-16592304
 ] 

Jie Yu commented on MESOS-9183:
---

this is probably due to we convert everthing to 
`boost::icl::interval_bounds::static_right_open`, causing an overflow in this 
case.

> IntervalSet up bound is one off
> -
>
> Key: MESOS-9183
> URL: https://issues.apache.org/jira/browse/MESOS-9183
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Xudong Ni
>Priority: Minor
>
> the unsigned int 16 range is [0, 65535]; if we tried to set this range, the 
> set will be "{}"
> Example code:
> {quote}IntervalSet set;
> set += (Bound::closed(0), Bound::closed(65535));
> Results: "{}"; Expected: "[0, 65535]"
> {quote}
> If we decrease the up bound by 1 to 65534, it work normally.
> {quote}IntervalSet set;
> set += (Bound::closed(0), Bound::closed(65534));
> Results: "[0, 65535)"; Expected: "[0, 65535)"
> {quote}
> It appears the the up bound is one off, since the inervalSet is template, 
> other type may have the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9183) IntervalSet up bound is one off

2018-08-24 Thread Xudong Ni (JIRA)
Xudong Ni created MESOS-9183:


 Summary: IntervalSet up bound is one off
 Key: MESOS-9183
 URL: https://issues.apache.org/jira/browse/MESOS-9183
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Xudong Ni


the unsigned int 16 range is [0, 65535]; if we tried to set this range, the set 
will be "{}"

Example code:
{quote}IntervalSet set;

set += (Bound::closed(0), Bound::closed(65535));

Results: "{}"; Expected: "[0, 65535]"
{quote}
If we decrease the up bound by 1 to 65534, it work normally.
{quote}IntervalSet set;

set += (Bound::closed(0), Bound::closed(65534));

Results: "[0, 65535)"; Expected: "[0, 65535)"
{quote}
It appears the the up bound is one off, since the inervalSet is template, other 
type may have the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9182) Improve `class Slave` in the allocator.

2018-08-24 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9182:
---

 Summary: Improve `class Slave` in the allocator.
 Key: MESOS-9182
 URL: https://issues.apache.org/jira/browse/MESOS-9182
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Meng Zhu
Assignee: Meng Zhu


Currently, there are several issues with Slave class in the allocator:

(1) Resources on an agent are characterized by two variables: total and 
allocated. However, these two related fields are currently mutated separately 
by different member functions, leading to temporary inconsistencies. This is 
fragile and has produced several odd logic flows.

(2) While we track aggregated allocated resources on the agent, we do not know 
which frameworks those resources are allocated to. This lack of information 
makes several things difficult. For example, the odd agent removal logic 
described in MESOS-621. And also, currently, we can not update the framework 
sorter by simply looking at the Slave class. This leads to convoluted 
update/tracking (un)allocated resources logic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8270) Add an agent endpoint to list all active resource providers

2018-08-24 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592022#comment-16592022
 ] 

Chun-Hung Hsiao commented on MESOS-8270:


The following documentation patch has been committed and backported to 1.7.0:
{noformat}
commit 71137e87e2ffd8b7455373b6b02c5e662b244cfa (HEAD -> upstream/master)
Author: Benjamin Bannier 
Date: Fri Aug 24 11:01:57 2018 -0700

Documented the `GET_RESOURCE_PROVIDERS` agent API call.

Review: https://reviews.apache.org/r/68504/{noformat}
{noformat}
commit 3cac4ba05bd76bdb4a3100d34b8151a85592701b (HEAD -> upstream/1.7.x)
Author: Benjamin Bannier 
Date: Fri Aug 24 11:01:57 2018 -0700

Documented the `GET_RESOURCE_PROVIDERS` agent API call.

Review: https://reviews.apache.org/r/68504/{noformat}

> Add an agent endpoint to list all active resource providers
> ---
>
> Key: MESOS-8270
> URL: https://issues.apache.org/jira/browse/MESOS-8270
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.5.0
>
>
> Operators/Frameworks might need information about all resource providers 
> currently running on an agent. An API endpoint should provide that 
> information and include resource provider name and type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8158) Mesos Agent in docker neglects to retry discovering Task docker containers

2018-08-24 Thread Sjoerd Mulder (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591768#comment-16591768
 ] 

Sjoerd Mulder commented on MESOS-8158:
--

I'm also experiencing this with similar setup (mesos-agent inside docker with 
docker_mesos_image flag) mesos using version 1.6.1

> Mesos Agent in docker neglects to retry discovering Task docker containers
> --
>
> Key: MESOS-8158
> URL: https://issues.apache.org/jira/browse/MESOS-8158
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.4.0
> Environment: Windows 10 with Docker version 17.09.0-ce, build afdb6d4
>Reporter: Charles R Allen
>Priority: Major
>
> I have attempted to launch Mesos agents inside of a docker container in such 
> a way where the agent docker can be replaced and recovered. Unfortunately I 
> hit a major snag in the way the mesos docker launching works.
> To test simple functionality a marathon app is setup that simply has the 
> following command: {{date && python -m SimpleHTTPServer $PORT0}} 
> That way the HTTP port can be accessed to assure things are being assigned 
> correctly, and the date is printed out in the log.
> When I attempt to start this marathon app, the mesos agent (inside a docker 
> container) properly launches an executor which properly creates a second task 
> that launches the python code. Here's the output from the executor logs (this 
> looks correct):
> {code}
> I1101 20:34:03.420210 68270 exec.cpp:162] Version: 1.4.0
> I1101 20:34:03.427455 68281 exec.cpp:237] Executor registered on agent 
> d9bb6e96-ee26-43c2-977e-0c404fdd4e81-S0
> I1101 20:34:03.428414 68283 executor.cpp:120] Registered docker executor on 
> 10.0.75.2
> I1101 20:34:03.428680 68281 executor.cpp:160] Starting task 
> testapp.fe35282f-bf43-11e7-a24b-0242ac110002
> I1101 20:34:03.428941 68281 docker.cpp:1080] Running docker -H 
> unix:///var/run/docker.sock run --cpu-shares 1024 --memory 134217728 -e 
> HOST=10.0.75.2 -e MARATHON_APP_DOCKER_IMAGE=python:2 -e 
> MARATHON_APP_ID=/testapp -e MARATHON_APP_LABELS= -e MARATHON_APP_RESOURCE_CPUS
> =1.0 -e MARATHON_APP_RESOURCE_DISK=0.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e 
> MARATHON_APP_RESOURCE_MEM=128.0 -e 
> MARATHON_APP_VERSION=2017-11-01T20:33:44.869Z -e 
> MESOS_CONTAINER_NAME=mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75 -e 
> MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_TA
> SK_ID=testapp.fe35282f-bf43-11e7-a24b-0242ac110002 -e PORT=31464 -e 
> PORT0=31464 -e PORTS=31464 -e PORT_1=31464 -e PORT_HTTP=31464 -v 
> /var/run/mesos/slaves/d9bb6e96-ee26-43c2-977e-0c404fdd4e81-S0/frameworks/a5eb6da1-f8ac-4642-8d66-cdd2e5b14d45-0001/executors/testapp
> .fe35282f-bf43-11e7-a24b-0242ac110002/runs/84f9ae30-9d4c-484a-860c-ca7845b7ec75:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name 
> mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75 
> --label=MESOS_TASK_ID=testapp.fe35282f-bf43-11e7-a24b-0242ac110002 python:2 
> -c date && p
> ython -m SimpleHTTPServer $PORT0
> I1101 20:34:03.430402 68281 docker.cpp:1243] Running docker -H 
> unix:///var/run/docker.sock inspect mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75
> I1101 20:34:03.520303 68286 docker.cpp:1290] Retrying inspect with non-zero 
> status code. cmd: 'docker -H unix:///var/run/docker.sock inspect 
> mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75', interval: 500ms
> I1101 20:34:04.021216 68288 docker.cpp:1243] Running docker -H 
> unix:///var/run/docker.sock inspect mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75
> I1101 20:34:04.124490 68281 docker.cpp:1290] Retrying inspect with non-zero 
> status code. cmd: 'docker -H unix:///var/run/docker.sock inspect 
> mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75', interval: 500ms
> I1101 20:34:04.624964 68288 docker.cpp:1243] Running docker -H 
> unix:///var/run/docker.sock inspect mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75
> I1101 20:34:04.934087 68286 docker.cpp:1345] Retrying inspect since container 
> not yet started. cmd: 'docker -H unix:///var/run/docker.sock inspect 
> mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75', interval: 500ms
> I1101 20:34:05.435145 68288 docker.cpp:1243] Running docker -H 
> unix:///var/run/docker.sock inspect mesos-84f9ae30-9d4c-484a-860c-ca7845b7ec75
> Wed Nov  1 20:34:06 UTC 2017
> {code}
> But, somehow there is a TASK_FAILED message sent to marathon.
> Upon further investigation, the following snippet can be found in the agent 
> logs (running in a docker container)
> {code}
> I1101 20:34:00.949129 9 slave.cpp:1736] Got assigned task 
> 'testapp.fe35282f-bf43-11e7-a24b-0242ac110002' for framework 
> a5eb6da1-f8ac-4642-8d66-cdd2e5b14d45-0001
> I1101 20:34:00.950150 9 gc.cpp:93] Unscheduling 
> '/var/run/mesos/slaves/d9bb6e96-ee26-43c2-977e-0c404fdd4e81-S0

[jira] [Assigned] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks

2018-08-24 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-9131:


Assignee: Andrei Budnik  (was: Qian Zhang)

> Health checks launching nested containers while a container is being 
> destroyed lead to unkillable tasks
> ---
>
> Key: MESOS-9131
> URL: https://issues.apache.org/jira/browse/MESOS-9131
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.5.1
>Reporter: Jan Schlicht
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: container-stuck
>
> A container might get stuck in {{DESTROYING}} state if there's a command 
> health check that starts new nested containers while its parent container is 
> getting destroyed.
> Here are some logs which unrelated lines removed. The 
> `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping 
> afterwards.
> {noformat}
> 2018-04-16 12:37:54: I0416 12:37:54.235877  3863 containerizer.cpp:2807] 
> Container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has 
> exited
> 2018-04-16 12:37:54: I0416 12:37:54.235914  3863 containerizer.cpp:2354] 
> Destroying container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in 
> RUNNING state
> 2018-04-16 12:37:54: I0416 12:37:54.235932  3863 containerizer.cpp:2968] 
> Transitioning the state of container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 
> from RUNNING to DESTROYING
> 2018-04-16 12:37:54: I0416 12:37:54.236100  3852 linux_launcher.cpp:514] 
> Asked to destroy container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.237671  3852 linux_launcher.cpp:560] 
> Using freezer to destroy cgroup 
> mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.240327  3852 cgroups.cpp:3060] Freezing 
> cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.244179  3852 cgroups.cpp:1415] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
>  after 3.814144ms
> 2018-04-16 12:37:54: I0416 12:37:54.250550  3853 cgroups.cpp:3078] Thawing 
> cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.256599  3853 cgroups.cpp:1444] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
>  after 5.977856ms
> ...
> 2018-04-16 12:37:54: I0416 12:37:54.371117  3837 http.cpp:3502] Processing 
> LAUNCH_NESTED_CONTAINER_SESSION call for container 
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd'
> 2018-04-16 12:37:54: W0416 12:37:54.371692  3842 http.cpp:2758] Failed to 
> launch container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd:
>  Parent container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is 
> in 'DESTROYING' state
> 2018-04-16 12:37:54: W0416 12:37:54.371826  3840 containerizer.cpp:2337] 
> Attempted to destroy unknown container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd
> ...
> 2018-04-16 12:37:55: I0416 12:37:55.504456  3856 http.cpp:3078] Processing 
> REMOVE_NESTED_CONTAINER call for container 
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-f3a1238c-7f0f-4db3-bda4-c0ea951d46b6'
> ...
> 2018-04-16 12:37:55: I0416 12:37:55.556367  3857 http.cpp:3502] Processing 
> LAUNCH_NESTED_CONTAINER_SESSION call for container 
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211'
> ...
> 2018-04-16 12:37:55: W0416 12:37:55.582137  3850 http.cpp:2758] Failed to 
> launch container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211:
>  Parent container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is 
> in 'DESTROYING' state
> ...
> 2018-04-16 12:37:55: W0416 12:37:

[jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-24 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591235#comment-16591235
 ] 

Qian Zhang edited comment on MESOS-8568 at 8/24/18 7:05 AM:


I ran the exactly same reproduce steps with the above patch applied, and found 
this issue was gone, there is only one check container's sandbox directory at 
any time.
{code:java}
$ ls -la 
/home/qzhang/opt/mesos/slaves/1eada535-3848-4c76-b8c5-0e9e0d6fa102-S0/frameworks/8a842ab3-8aba-4d64-a744-ae98bdcf6d59-/executors/default-executor/runs/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/containers/06e7c625-596c-454c-b092-f17a81073349/containers
 | grep check | wc -l
{code}
Here is the agent log, we can see `WAIT_NESTED_CONTAINER` was called before 
`REMOVE_NESTED_CONTAINER` was called.
{code:java}
I0823 19:46:39.269901 32604 http.cpp:3366] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.277669 32603 switchboard.cpp:316] Container logger module 
finished preparing container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18;
 IOSwitchboard server is required
I0823 19:46:39.284180 32603 systemd.cpp:98] Assigned child process '34701' to 
'mesos_executors.slice'
I0823 19:46:39.284451 32603 switchboard.cpp:604] Created I/O switchboard server 
(pid: 34701) listening on socket file 
'/tmp/mesos-io-switchboard-12e8e4c7-268e-4184-881c-a16b61fa260c' for container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.288053 32641 linux_launcher.cpp:492] Launching nested container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 and cloning with namespaces 
W0823 19:46:39.302271 32636 http.cpp:2635] Failed to launch container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18:
 Collect failed: ==Fake error==
I0823 19:46:39.304822 32639 linux_launcher.cpp:580] Asked to destroy container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.305047 32639 linux_launcher.cpp:622] Destroying cgroup 
'/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.306437 32646 cgroups.cpp:2838] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.307015 32614 cgroups.cpp:1229] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 after 419840ns
I0823 19:46:39.307715 32641 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 
10.0.49.2:42086
I0823 19:46:39.308198 32646 cgroups.cpp:2856] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.308298 32641 http.cpp:2685] Processing WAIT_NESTED_CONTAINER 
call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.308583 32605 cgroups.cpp:1258] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 after 265728ns
I0823 19:46:39.373747 32616 linux_launcher.cpp:654] Destroying cgroup 
'/sys/fs/cgroup/systemd/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:44.375650 32647 switchboard.cpp:807] Sending SIGTERM to I/O 
switchboard server (pid: 34701) since container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 is being destroyed
I0823 19:46:44.403535 32637 switchboard.cpp:913] I/O switchboard server process 
for container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 has terminated (status=0)
I0823 19:46:47.420578 32622 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 
10.0.49.2:42088
I0823 19:46:47.421331 32622 http.cpp:2971] Processing REMOVE_NESTED_CONTAINER 
call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:47.42

[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-24 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591235#comment-16591235
 ] 

Qian Zhang commented on MESOS-8568:
---

I ran the exactly same reproduce steps with the above patch applied, and found 
this issue was gone, there is only one check container's sandbox directory at 
any time.
{code:java}
$ ls -la 
/home/qzhang/opt/mesos/slaves/1eada535-3848-4c76-b8c5-0e9e0d6fa102-S0/frameworks/8a842ab3-8aba-4d64-a744-ae98bdcf6d59-/executors/default-executor/runs/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/containers/06e7c625-596c-454c-b092-f17a81073349/containers
 | grep check | wc -l
{code}
Here is the agent log, we can see `WAIT_NESTED_CONTAINER` was called before 
`REMOVE_NESTED_CONTAINER` was called.

 
{code:java}
I0823 19:46:39.269901 32604 http.cpp:3366] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.277669 32603 switchboard.cpp:316] Container logger module 
finished preparing container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18;
 IOSwitchboard server is required
I0823 19:46:39.284180 32603 systemd.cpp:98] Assigned child process '34701' to 
'mesos_executors.slice'
I0823 19:46:39.284451 32603 switchboard.cpp:604] Created I/O switchboard server 
(pid: 34701) listening on socket file 
'/tmp/mesos-io-switchboard-12e8e4c7-268e-4184-881c-a16b61fa260c' for container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.288053 32641 linux_launcher.cpp:492] Launching nested container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 and cloning with namespaces 
W0823 19:46:39.302271 32636 http.cpp:2635] Failed to launch container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18:
 Collect failed: ==Fake error==
I0823 19:46:39.304822 32639 linux_launcher.cpp:580] Asked to destroy container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.305047 32639 linux_launcher.cpp:622] Destroying cgroup 
'/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.306437 32646 cgroups.cpp:2838] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.307015 32614 cgroups.cpp:1229] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 after 419840ns
I0823 19:46:39.307715 32641 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 
10.0.49.2:42086
I0823 19:46:39.308198 32646 cgroups.cpp:2856] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
I0823 19:46:39.308298 32641 http.cpp:2685] Processing WAIT_NESTED_CONTAINER 
call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:39.308583 32605 cgroups.cpp:1258] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 after 265728ns
I0823 19:46:39.373747 32616 linux_launcher.cpp:654] Destroying cgroup 
'/sys/fs/cgroup/systemd/mesos/9a369757-3a5e-47f9-9bfc-adcf3608d8dc/mesos/06e7c625-596c-454c-b092-f17a81073349/mesos/check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:44.375650 32647 switchboard.cpp:807] Sending SIGTERM to I/O 
switchboard server (pid: 34701) since container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 is being destroyed
I0823 19:46:44.403535 32637 switchboard.cpp:913] I/O switchboard server process 
for container 
9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18
 has terminated (status=0)
I0823 19:46:47.420578 32622 http.cpp:1117] HTTP POST for /slave(1)/api/v1 from 
10.0.49.2:42088
I0823 19:46:47.421331 32622 http.cpp:2971] Processing REMOVE_NESTED_CONTAINER 
call for container 
'9a369757-3a5e-47f9-9bfc-adcf3608d8dc.06e7c625-596c-454c-b092-f17a81073349.check-142ccb3b-9ba8-4a04-a79f-29147b921d18'
I0823 19:46:47.427382 32636 http.cpp:1117] HTTP POST for /slave