[ https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16425431#comment-16425431 ]
Andrei Budnik commented on MESOS-8489: -------------------------------------- We have multiple race conditions between simultaneously running agents in tests. We launch slaves using the same cgroup hierarchy by default. Linux launcher and some isolators call `cgroups::prepare()`, which creates and then immediately removes `mesos/test` cgroup to check whether the kernel supports nested cgroups. First race condition is between `LinuxLauncher::create()` and `LinuxLauncher::recover()`. First one calls `cgroups::prepare()` while the other iterates over cgroups hierarchy to detect orphan containers. Also, we call `destroy()` for detected orphan containers - that also leads to a race condition. Second race condition happens when `cgroups::prepare()` is called in parallel. https://reviews.apache.org/r/66449/ - fixes all above cases for `LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags` test. > LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky > -------------------------------------------------------------- > > Key: MESOS-8489 > URL: https://issues.apache.org/jira/browse/MESOS-8489 > Project: Mesos > Issue Type: Bug > Components: containerization > Reporter: Andrei Budnik > Assignee: Andrei Budnik > Priority: Major > Labels: containerizer, flaky-test, mesosphere > Attachments: ROOT_IsolatorFlags-badrun3.txt > > > Observed this on internal Mesosphere CI. > {code:java} > ../../src/tests/cluster.cpp:662: Failure > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { test } > {code} > h2. Steps to reproduce > # Add {{::sleep(1);}} before > [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483] > "test" cgroup > # recompile > # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests > --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags > --gtest_break_on_failure --gtest_repeat=10 --verbose` > h2. Race description > While recovery is in progress for [the first > slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733], > calling > [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738] > leads to calling > [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431] > to create a containerizer. An attempt to create a mesos c'zer, leads to > calling > [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124]. > Finally, we get to the point, where we try to create a ["test" > container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476]. > So, the recovery process for the second slave [might > detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301] > this "test" container as an orphaned container. > Thus, there is the race between recovery process for the first slave and an > attempt to create a c'zer for the second agent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)