[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212605#comment-16212605
 ] 

Andrei Budnik edited comment on MESOS-7506 at 10/20/17 6:28 PM:
----------------------------------------------------------------

Bug has been reproduced with extra debug logs 
(SlaveTest.ShutdownUnregisteredExecutor):
{code}
I1020 12:07:20.266032  9274 containerizer.cpp:2220] Destroying container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 in RUNNING state
I1020 12:07:20.266042  9274 containerizer.cpp:2784] Transitioning the state of 
container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 from RUNNING to DESTROYING
I1020 12:07:20.266175  9274 linux_launcher.cpp:514] Asked to destroy container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.266717  9274 linux_launcher.cpp:560] Using freezer to destroy 
cgroup mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268649  9274 cgroups.cpp:1562] TasksKiller::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268756  9274 cgroups.cpp:3083] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.269533  9276 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.270486  9276 cgroups.cpp:1422] Freezer::freeze 2: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542: FREEZING
I1020 12:07:20.270725  9272 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.271625  9272 cgroups.cpp:1415] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 1secs
I1020 12:07:20.271724  9272 hierarchical.cpp:1488] Performed allocation for 1 
agents in 18541ns
I1020 12:07:20.271767  9272 cgroups.cpp:1573] TasksKiller::kill: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273386  9272 cgroups.cpp:1596] TasksKiller::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273486  9272 cgroups.cpp:3101] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.274129  9272 cgroups.cpp:1431] Freezer::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.276964  9272 cgroups.cpp:1448] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 0ns
I1020 12:07:20.277225  9277 cgroups.cpp:1602] TasksKiller::reap: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.277613  9279 hierarchical.cpp:1488] Performed allocation for 1 
agents in 17680ns
I1020 12:07:20.277772  9279 containerizer.cpp:2671] Container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 has exited
{code}
{{TasksKiller::finished}} wasn't called, while {{TasksKiller::reap}} was called.


was (Author: abudnik):
Bug has been reproduced with extra debug logs 
(SlaveTest.ShutdownUnregisteredExecutor):
{code}
I1020 12:07:20.266032  9274 containerizer.cpp:2220] Destroying container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 in RUNNING state
I1020 12:07:20.266042  9274 containerizer.cpp:2784] Transitioning the state of 
container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 from RUNNING to DESTROYING
I1020 12:07:20.266175  9274 linux_launcher.cpp:514] Asked to destroy container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.266717  9274 linux_launcher.cpp:560] Using freezer to destroy 
cgroup mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268649  9274 cgroups.cpp:1562] TasksKiller::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268756  9274 cgroups.cpp:3083] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.269533  9276 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.270486  9276 cgroups.cpp:1422] Freezer::freeze 2: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542: FREEZING
I1020 12:07:20.270725  9272 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.271625  9272 cgroups.cpp:1415] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 1secs
I1020 12:07:20.271724  9272 hierarchical.cpp:1488] Performed allocation for 1 
agents in 18541ns
I1020 12:07:20.271767  9272 cgroups.cpp:1573] TasksKiller::kill: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273386  9272 cgroups.cpp:1596] TasksKiller::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273486  9272 cgroups.cpp:3101] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.274129  9272 cgroups.cpp:1431] Freezer::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.276964  9272 cgroups.cpp:1448] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 0ns
I1020 12:07:20.277225  9277 cgroups.cpp:1602] TasksKiller::reap: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.277613  9279 hierarchical.cpp:1488] Performed allocation for 1 
agents in 17680ns
I1020 12:07:20.277772  9279 containerizer.cpp:2671] Container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 has exited
{code}
{{TasksKiller::finished}} wasn't called, while {{TasksKiller::reap}} was 
called. So, I assume there is a race condition in {{TasksKiller::kill}}. 
Probably, {{cgroups::processes()}} called in {{TasksKiller::kill}} returns a 
list L1 which differs from a list L2 returned by the same function in 
{{cgroups::kill}}.

> Multiple tests leave orphan containers.
> ---------------------------------------
>
>                 Key: MESOS-7506
>                 URL: https://issues.apache.org/jira/browse/MESOS-7506
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>         Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>            Reporter: Alexander Rukletsov
>            Assignee: Andrei Budnik
>              Labels: containerizer, flaky-test, mesosphere
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to