Andrei Budnik created MESOS-8489: ------------------------------------ Summary: LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky Key: MESOS-8489 URL: https://issues.apache.org/jira/browse/MESOS-8489 Project: Mesos Issue Type: Bug Components: containerization Reporter: Andrei Budnik Attachments: ROOT_IsolatorFlags-badrun3.txt
Observed this on internal Mesosphere CI. {code:java} ../../src/tests/cluster.cpp:662: Failure Value of: containers->empty() Actual: false Expected: true Failed to destroy containers: { test } {code} h2. Steps to reproduce # Add {{::sleep(1);}} before [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483] "test" cgroup # recompile # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags --gtest_break_on_failure --gtest_repeat=10 --verbose` h2. Race description While recovery is in progress for [the first slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733], calling [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738] leads to calling [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431] to create a containerizer. An attempt to create a mesos c'zer, leads to calling [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124]. Finally, we get to the point, where we try to create a ["test" container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476]. So, the recovery process for the first slave [might detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301] this "test" container as an orphaned container. Thus, there is the race between recovery process for the first slave and an attempt to create a c'zer for the second agent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)