> On Dec. 22, 2017, 3:38 a.m., Greg Mann wrote: > > This looks like a reasonable solution to me. However, it would be great if > > we could reproduce the bug and then verify the fix. Looking at the log of a > > failed test run in the JIRA, it seems to me that the problem occurs when > > cleanup of an orphaned container left over from a previous test is > > attempted by the agent destructor called during > > `LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags`. To attempt a > > repro, I would suggest the following: > > 1) Peg the CPU on the machine so that libprocess takes a long time to > > process messages in its queue > > 2) Run `LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags` and one > > other (fast-running) test which creates a container, setting > > '--gtest_repeat=-1' > > > > Hopefully, this may recreate the circumstances which led to the failure in > > CI? > > Andrei Budnik wrote: > That's a good idea! I'm going to try to reproduce the bug by running > multiple tests simultaneously.
I was able to reproduce the bug by creating an orphaned container and adding a few `sleep()`: https://github.com/abudnik/mesos/commit/db85fe7bd21e8d820e9b360b85d7129cffe8d3b6 Command to launch a test: ``` GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags --verbose ``` This test always fails when recovery completion is ignored. - Andrei ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64770/#review194394 ----------------------------------------------------------- On Dec. 21, 2017, 3:58 p.m., Andrei Budnik wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64770/ > ----------------------------------------------------------- > > (Updated Dec. 21, 2017, 3:58 p.m.) > > > Review request for mesos, Alexander Rukletsov, Greg Mann, and Joseph Wu. > > > Bugs: MESOS-7506 > https://issues.apache.org/jira/browse/MESOS-7506 > > > Repository: mesos > > > Description > ------- > > There was a race condition leading to flaky > `LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags` test. > This test launches successively multiple agents, while reusing the same > variable. After reassigning the value of the variable, agent's d'tor is > called. If agent recovery is not yet completed, then some orphaned > container might blink in the agent's d'tor as it is described in the > comment to the code. > > > Diffs > ----- > > src/tests/cluster.cpp f964bf0cd0cf22374877e5748ba142dcb5fee133 > > > Diff: https://reviews.apache.org/r/64770/diff/4/ > > > Testing > ------- > > sudo make check (fedora 25) > internal CI > > > Thanks, > > Andrei Budnik > >