[ https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243729#comment-16243729 ]
Andrei Budnik commented on MESOS-7506: -------------------------------------- *Second cause* {{[ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/default_executor_tests.cpp#L1912]}} launches task group, so each task is launched using {{ComposingContainerizer}}. When this test completes (after receiving TASK_FINISHED status update), Slave d-tor is called, where [it waits|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L574] for each container to trigger a [container's termination future|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/slave/containerizer/mesos/containerizer.cpp#L2528]. As this test uses {{ComposingContainerizer}}, [calling destroy|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L572] for a container means {{ComposingContainerizer}} subscribes for the same [container's termination future|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/slave/containerizer/composing.cpp#L638-L647] via {{onAny}} method. Once this future is triggered, the lambda function is dispatched. This lambda removes {{containerId}} from the hash set. When a container's termination future is set [is set|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/future.hpp#L1524], then {{[AWAIT(wait)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L574]}} might [be satisfied|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/gtest.hpp#L83], hence container's hash set will be [requested (dispatched)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L577]. There is a race between a thread which sets the container's termination future, calling {{onReadyCallbacks}} and {{onAnyCallbacks}}, where calling {{onAnyCallbacks}} leads to dispatching aforementioned lambda, and a test thread which waits for the container's termination future and then calls {{containerizer->containers()}}. To reproduce this case, we need to add one sleep for ~10ms before [internal::run(copy->onAnyCallbacks, *this)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/future.hpp#L1537] and remove another sleep from [process::internal::await |https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/gtest.hpp#L92]. > Multiple tests leave orphan containers. > --------------------------------------- > > Key: MESOS-7506 > URL: https://issues.apache.org/jira/browse/MESOS-7506 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Ubuntu 16.04 > Fedora 23 > other Linux distros > Reporter: Alexander Rukletsov > Assignee: Andrei Budnik > Labels: containerizer, flaky-test, mesosphere > Attachments: KillMultipleTasks-badrun.txt, > ResourceLimitation-badrun.txt, TaskWithFileURI-badrun.txt > > > I've observed a number of flaky tests that leave orphan containers upon > cleanup. A typical log looks like this: > {noformat} > ../../src/tests/cluster.cpp:580: Failure > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 } > {noformat} > All currently affected tests: > {noformat} > ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 > ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0 > ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0 > ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0 > SlaveRecoveryTest/0.RecoverUnregisteredExecutor > SlaveRecoveryTest/0.CleanupExecutor > SlaveRecoveryTest/0.RecoverTerminatedExecutor > SlaveTest.ShutdownUnregisteredExecutor > ShutdownUnregisteredExecutor > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)