[ 
https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16425431#comment-16425431
 ] 

Andrei Budnik commented on MESOS-8489:
--------------------------------------

We have multiple race conditions between simultaneously running agents in 
tests. We launch slaves using the same cgroup hierarchy by default. Linux 
launcher and some isolators call `cgroups::prepare()`, which creates and then 
immediately removes `mesos/test` cgroup to check whether the kernel supports 
nested cgroups.

First race condition is between `LinuxLauncher::create()` and 
`LinuxLauncher::recover()`. First one calls `cgroups::prepare()` while the 
other iterates over cgroups hierarchy to detect orphan containers. Also, we 
call `destroy()` for detected orphan containers - that also leads to a race 
condition.

Second race condition happens when `cgroups::prepare()` is called in parallel.

https://reviews.apache.org/r/66449/ - fixes all above cases for 
`LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags` test.

> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
> --------------------------------------------------------------
>
>                 Key: MESOS-8489
>                 URL: https://issues.apache.org/jira/browse/MESOS-8489
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Andrei Budnik
>            Assignee: Andrei Budnik
>            Priority: Major
>              Labels: containerizer, flaky-test, mesosphere
>         Attachments: ROOT_IsolatorFlags-badrun3.txt
>
>
> Observed this on internal Mesosphere CI.
> {code:java}
> ../../src/tests/cluster.cpp:662: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { test }
> {code}
> h2. Steps to reproduce
>  # Add {{::sleep(1);}} before 
> [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483]
>  "test" cgroup
>  # recompile
>  # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
> --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags 
> --gtest_break_on_failure --gtest_repeat=10 --verbose`
> h2. Race description
> While recovery is in progress for [the first 
> slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
>  calling 
> [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
>  leads to calling 
> [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
>  to create a containerizer. An attempt to create a mesos c'zer, leads to 
> calling 
> [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
>  Finally, we get to the point, where we try to create a ["test" 
> container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].
>  So, the recovery process for the second slave [might 
> detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
>  this "test" container as an orphaned container.
> Thus, there is the race between recovery process for the first slave and an 
> attempt to create a c'zer for the second agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to