[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243729#comment-16243729
 ] 

Andrei Budnik commented on MESOS-7506:
--------------------------------------

*Second cause*

{{[ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/default_executor_tests.cpp#L1912]}}
 launches task group, so each task is launched using {{ComposingContainerizer}}.
When this test completes (after receiving TASK_FINISHED status update), Slave 
d-tor is called, where [it 
waits|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L574]
 for each container to trigger a [container's termination 
future|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/slave/containerizer/mesos/containerizer.cpp#L2528].
As this test uses {{ComposingContainerizer}}, [calling 
destroy|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L572]
 for a container means {{ComposingContainerizer}} subscribes for the same 
[container's termination 
future|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/slave/containerizer/composing.cpp#L638-L647]
 via {{onAny}} method. Once this future is triggered, the lambda function is 
dispatched. This lambda removes {{containerId}} from the hash set.

When a container's termination future is set [is 
set|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/future.hpp#L1524],
 then 
{{[AWAIT(wait)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L574]}}
 might [be 
satisfied|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/gtest.hpp#L83],
 hence container's hash set will be [requested 
(dispatched)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L577].
 There is a race between a thread which sets the container's termination 
future, calling {{onReadyCallbacks}} and {{onAnyCallbacks}}, where calling 
{{onAnyCallbacks}} leads to dispatching aforementioned lambda, and a test 
thread which waits for the container's termination future and then calls 
{{containerizer->containers()}}.

To reproduce this case, we need to add one sleep for ~10ms before 
[internal::run(copy->onAnyCallbacks, 
*this)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/future.hpp#L1537]
 and remove another sleep from [process::internal::await 
|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/gtest.hpp#L92].

> Multiple tests leave orphan containers.
> ---------------------------------------
>
>                 Key: MESOS-7506
>                 URL: https://issues.apache.org/jira/browse/MESOS-7506
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>         Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>            Reporter: Alexander Rukletsov
>            Assignee: Andrei Budnik
>              Labels: containerizer, flaky-test, mesosphere
>         Attachments: KillMultipleTasks-badrun.txt, 
> ResourceLimitation-badrun.txt, TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveRecoveryTest/0.RecoverUnregisteredExecutor
> SlaveRecoveryTest/0.CleanupExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> SlaveTest.ShutdownUnregisteredExecutor
> ShutdownUnregisteredExecutor
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to