[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15496662#comment-15496662 ]
Greg Mann commented on MESOS-6180: ---------------------------------- Thanks for the patches, [~haosd...@gmail.com]!! I'll review and do some testing this morning. Regarding the interleaving: for example, in the log posted in MESOS-6164 we find the line: {code} Checkpointing framework pid 'scheduler-26d5bb2d-7233-4725-9755-169f84aee769@172.30.2.23:32968' to '/mnt/teamcity/temp/buildTmp/SlaveRecoveryTest_0_RecoverStatusUpdateManager_w0ToCt/meta/slaves/d22b6309-24c3-422f-a501-a672e7c3e046-S0/frameworks/d22b6309-24c3-422f-a501-a672e7c3e046-0000/framework.pid' {code} which indicates that this output can be attributed to {{SlaveRecoveryTest.RecoverStatusUpdateManager}}. I think {{SlaveRecoveryTest.ReconnectHTTPExecutor}} begins much later with the line: {{I0915 02:57:42.981866 24202 cluster.cpp:157] Creating default 'local' authorizer}}. > Several tests are flaky, with futures timing out early > ------------------------------------------------------ > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests > Reporter: Greg Mann > Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)