[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

Greg Mann (JIRA) Fri, 16 Sep 2016 09:04:46 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15496662#comment-15496662
 ]


Greg Mann commented on MESOS-6180:
----------------------------------

Thanks for the patches, [~haosd...@gmail.com]!! I'll review and do some testing 
this morning.

Regarding the interleaving: for example, in the log posted in MESOS-6164 we 
find the line:
{code}
Checkpointing framework pid 
'scheduler-26d5bb2d-7233-4725-9755-169f84aee769@172.30.2.23:32968' to 
'/mnt/teamcity/temp/buildTmp/SlaveRecoveryTest_0_RecoverStatusUpdateManager_w0ToCt/meta/slaves/d22b6309-24c3-422f-a501-a672e7c3e046-S0/frameworks/d22b6309-24c3-422f-a501-a672e7c3e046-0000/framework.pid'
{code}
which indicates that this output can be attributed to 
{{SlaveRecoveryTest.RecoverStatusUpdateManager}}. I think 
{{SlaveRecoveryTest.ReconnectHTTPExecutor}} begins much later with the line: 
{{I0915 02:57:42.981866 24202 cluster.cpp:157] Creating default 'local' 
authorizer}}.

> Several tests are flaky, with futures timing out early
> ------------------------------------------------------
>
>                 Key: MESOS-6180
>                 URL: https://issues.apache.org/jira/browse/MESOS-6180
>             Project: Mesos
>          Issue Type: Bug
>          Components: tests
>            Reporter: Greg Mann
>            Assignee: haosdent
>              Labels: mesosphere, tests
>         Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

Reply via email to