[ https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212605#comment-16212605 ]
Andrei Budnik edited comment on MESOS-7506 at 10/20/17 6:28 PM: ---------------------------------------------------------------- Bug has been reproduced with extra debug logs (SlaveTest.ShutdownUnregisteredExecutor): {code} I1020 12:07:20.266032 9274 containerizer.cpp:2220] Destroying container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 in RUNNING state I1020 12:07:20.266042 9274 containerizer.cpp:2784] Transitioning the state of container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 from RUNNING to DESTROYING I1020 12:07:20.266175 9274 linux_launcher.cpp:514] Asked to destroy container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.266717 9274 linux_launcher.cpp:560] Using freezer to destroy cgroup mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.268649 9274 cgroups.cpp:1562] TasksKiller::freeze: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.268756 9274 cgroups.cpp:3083] Freezing cgroup /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.269533 9276 cgroups.cpp:1397] Freezer::freeze: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.270486 9276 cgroups.cpp:1422] Freezer::freeze 2: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542: FREEZING I1020 12:07:20.270725 9272 cgroups.cpp:1397] Freezer::freeze: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.271625 9272 cgroups.cpp:1415] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 1secs I1020 12:07:20.271724 9272 hierarchical.cpp:1488] Performed allocation for 1 agents in 18541ns I1020 12:07:20.271767 9272 cgroups.cpp:1573] TasksKiller::kill: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.273386 9272 cgroups.cpp:1596] TasksKiller::thaw: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.273486 9272 cgroups.cpp:3101] Thawing cgroup /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.274129 9272 cgroups.cpp:1431] Freezer::thaw: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.276964 9272 cgroups.cpp:1448] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 0ns I1020 12:07:20.277225 9277 cgroups.cpp:1602] TasksKiller::reap: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.277613 9279 hierarchical.cpp:1488] Performed allocation for 1 agents in 17680ns I1020 12:07:20.277772 9279 containerizer.cpp:2671] Container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 has exited {code} {{TasksKiller::finished}} wasn't called, while {{TasksKiller::reap}} was called. was (Author: abudnik): Bug has been reproduced with extra debug logs (SlaveTest.ShutdownUnregisteredExecutor): {code} I1020 12:07:20.266032 9274 containerizer.cpp:2220] Destroying container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 in RUNNING state I1020 12:07:20.266042 9274 containerizer.cpp:2784] Transitioning the state of container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 from RUNNING to DESTROYING I1020 12:07:20.266175 9274 linux_launcher.cpp:514] Asked to destroy container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.266717 9274 linux_launcher.cpp:560] Using freezer to destroy cgroup mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.268649 9274 cgroups.cpp:1562] TasksKiller::freeze: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.268756 9274 cgroups.cpp:3083] Freezing cgroup /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.269533 9276 cgroups.cpp:1397] Freezer::freeze: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.270486 9276 cgroups.cpp:1422] Freezer::freeze 2: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542: FREEZING I1020 12:07:20.270725 9272 cgroups.cpp:1397] Freezer::freeze: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.271625 9272 cgroups.cpp:1415] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 1secs I1020 12:07:20.271724 9272 hierarchical.cpp:1488] Performed allocation for 1 agents in 18541ns I1020 12:07:20.271767 9272 cgroups.cpp:1573] TasksKiller::kill: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.273386 9272 cgroups.cpp:1596] TasksKiller::thaw: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.273486 9272 cgroups.cpp:3101] Thawing cgroup /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.274129 9272 cgroups.cpp:1431] Freezer::thaw: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.276964 9272 cgroups.cpp:1448] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 0ns I1020 12:07:20.277225 9277 cgroups.cpp:1602] TasksKiller::reap: /sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 I1020 12:07:20.277613 9279 hierarchical.cpp:1488] Performed allocation for 1 agents in 17680ns I1020 12:07:20.277772 9279 containerizer.cpp:2671] Container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 has exited {code} {{TasksKiller::finished}} wasn't called, while {{TasksKiller::reap}} was called. So, I assume there is a race condition in {{TasksKiller::kill}}. Probably, {{cgroups::processes()}} called in {{TasksKiller::kill}} returns a list L1 which differs from a list L2 returned by the same function in {{cgroups::kill}}. > Multiple tests leave orphan containers. > --------------------------------------- > > Key: MESOS-7506 > URL: https://issues.apache.org/jira/browse/MESOS-7506 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Ubuntu 16.04 > Fedora 23 > other Linux distros > Reporter: Alexander Rukletsov > Assignee: Andrei Budnik > Labels: containerizer, flaky-test, mesosphere > > I've observed a number of flaky tests that leave orphan containers upon > cleanup. A typical log looks like this: > {noformat} > ../../src/tests/cluster.cpp:580: Failure > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 } > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)