[jira] [Commented] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky

2018-02-22 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373066#comment-16373066
 ] 

Benno Evers commented on MESOS-8336:


https://reviews.apache.org/r/65758/

> MasterTest.RegistryUpdateAfterReconfiguration is flaky
> --
>
> Key: MESOS-8336
> URL: https://issues.apache.org/jira/browse/MESOS-8336
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: RegistryUpdateAfterReconfiguration-badrun.txt
>
>
> Observed here: 
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/
> The test here failed because the registry contained 2 slaves, when it should 
> have only one.
> Looking through the log, everything seems normal (in particular, only 1 slave 
> id appears throughout this test). The only thing out of the ordinary seems to 
> be the agent sending two `RegisterSlaveMessage`s and two 
> `ReregisterSlaveMessage`s, but looking at the code for generating the random 
> backoff factor in the slave that seems to be more or less normal, and 
> shouldn't break the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky

2018-02-21 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371439#comment-16371439
 ] 

Benno Evers commented on MESOS-8336:


The root cause here is a very familiar one, that has already rendered countless 
other tests flaky. In particular, I'm talking about this line in `slave.cpp`:
{noformat}
    // Wait for a random amount of time before authentication or
    // registration.
    Duration duration =
  flags.registration_backoff_factor * ((double) os::random() / 
RAND_MAX);{noformat}
Here, the agent is sending the re-tried `RegisterSlaveMessage` after 9ms, 
*just* before shutting down, and the master notices that the network link is 
down before it gets to processing the message.

This leads to the master assigning a second slave ID, almost immediately 
removing the slave again because the network link is broken as well, and 
finally the test seeing the remnants of this second slave in the registry.

 

> MasterTest.RegistryUpdateAfterReconfiguration is flaky
> --
>
> Key: MESOS-8336
> URL: https://issues.apache.org/jira/browse/MESOS-8336
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: RegistryUpdateAfterReconfiguration-badrun.txt
>
>
> Observed here: 
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/
> The test here failed because the registry contained 2 slaves, when it should 
> have only one.
> Looking through the log, everything seems normal (in particular, only 1 slave 
> id appears throughout this test). The only thing out of the ordinary seems to 
> be the agent sending two `RegisterSlaveMessage`s and two 
> `ReregisterSlaveMessage`s, but looking at the code for generating the random 
> backoff factor in the slave that seems to be more or less normal, and 
> shouldn't break the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky

2018-01-26 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340939#comment-16340939
 ] 

Alexander Rukletsov commented on MESOS-8336:


Disabled the test for now.

> MasterTest.RegistryUpdateAfterReconfiguration is flaky
> --
>
> Key: MESOS-8336
> URL: https://issues.apache.org/jira/browse/MESOS-8336
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: RegistryUpdateAfterReconfiguration-badrun.txt
>
>
> Observed here: 
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/
> The test here failed because the registry contained 2 slaves, when it should 
> have only one.
> Looking through the log, everything seems normal (in particular, only 1 slave 
> id appears throughout this test). The only thing out of the ordinary seems to 
> be the agent sending two `RegisterSlaveMessage`s and two 
> `ReregisterSlaveMessage`s, but looking at the code for generating the random 
> backoff factor in the slave that seems to be more or less normal, and 
> shouldn't break the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)