[ https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225515#comment-16225515 ]
Andrei Budnik commented on MESOS-7506: -------------------------------------- Some tests (from {{SlaveTest}} and {{SlaveRecoveryTest}}) have a pattern [like this|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/slave_tests.cpp#L393-L406], where the clock is advanced by {{executor_registration_timeout}} and then it waits in a loop until a task status update is sent. This loop is executing while the container is being destroyed. At the same time, container destruction consists of multiple steps, one of them waits for [cgroups destruction|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/linux_launcher.cpp#L567]. That means, we have a race between container destruction process and the loop that advances the clock, leading to the following outcomes: # Container completely destroyed, before clock advancing reaches timeout (e.g. {{cgroups::DESTROY_TIMEOUT}}). # Triggered timeout due to clock advancing, before container destruction completes. That results in [leaving orphaned|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/containerizer.cpp#L2367-L2380] containers that will be detected by [Slave destructor|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/cluster.cpp#L559-L584] in `tests/cluster.cpp`, so the test will fail. The issue is easily reproduced by advancing the clocks by 60 seconds or more in the loop, which waits for a status update. > Multiple tests leave orphan containers. > --------------------------------------- > > Key: MESOS-7506 > URL: https://issues.apache.org/jira/browse/MESOS-7506 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Ubuntu 16.04 > Fedora 23 > other Linux distros > Reporter: Alexander Rukletsov > Assignee: Andrei Budnik > Labels: containerizer, flaky-test, mesosphere > > I've observed a number of flaky tests that leave orphan containers upon > cleanup. A typical log looks like this: > {noformat} > ../../src/tests/cluster.cpp:580: Failure > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 } > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)