Repository: mesos Updated Branches: refs/heads/master 6ea47691d -> 4dae1f8d9
Fixed flakiness in OneWayPartitionTest.MasterToSlave. The test did not pause the clock. This allowed the following sequence of events to occur, with low probability: (1) Agent sends register message M1 to master. (2) Agent register timer expires, sends register message M2 to master. (3) Master sees M1 and adds agent with ID A1. (4) Agent gets SlaveRegisteredMessage with ID A1. (5) Test case injects `exited` event for agent; master marks agent as disconnected (6) Master sees M2; since the agent is currently disconnected, the master removes A1 and adds the agent with ID A2. (7) Agent gets SlaveRegisteredMessage with ID A2. Since this is unexpected, it exits ("Registered but got wrong id"). This commit fixes the test case to pause the clock; this prevents the second registration attempt in step (2) above. The scenario described above might occur in an actual Mesos deployment, albeit with very low probability. This would result in a Mesos agent shutting down immediately after initial registration. MESOS-7596 has been created to track this issue. Review: https://reviews.apache.org/r/59685 Project: http://git-wip-us.apache.org/repos/asf/mesos/repo Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/4dae1f8d Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/4dae1f8d Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/4dae1f8d Branch: refs/heads/master Commit: 4dae1f8d95b776850c84fbcd8427fe9072b7bc14 Parents: 6ea4769 Author: Neil Conway <neil.con...@gmail.com> Authored: Tue May 30 15:52:57 2017 -0700 Committer: Neil Conway <neil.con...@gmail.com> Committed: Wed May 31 11:04:55 2017 -0700 ---------------------------------------------------------------------- src/tests/partition_tests.cpp | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/mesos/blob/4dae1f8d/src/tests/partition_tests.cpp ---------------------------------------------------------------------- diff --git a/src/tests/partition_tests.cpp b/src/tests/partition_tests.cpp index 4ff4285..62a84f7 100644 --- a/src/tests/partition_tests.cpp +++ b/src/tests/partition_tests.cpp @@ -3416,6 +3416,10 @@ class OneWayPartitionTest : public MesosTest {}; // will re-register with the master. TEST_F_TEMP_DISABLED_ON_WINDOWS(OneWayPartitionTest, MasterToSlave) { + // Pausing the clock ensures that the agent does not attempt to + // register multiple times (see MESOS-7596 for context). + Clock::pause(); + // Start a master. master::Flags masterFlags = CreateMasterFlags(); Try<Owned<cluster::Master>> master = StartMaster(masterFlags); @@ -3432,6 +3436,7 @@ TEST_F_TEMP_DISABLED_ON_WINDOWS(OneWayPartitionTest, MasterToSlave) Try<Owned<cluster::Slave>> slave = StartSlave(detector.get(), agentFlags); ASSERT_SOME(slave); + Clock::advance(agentFlags.registration_backoff_factor); AWAIT_READY(slaveRegisteredMessage); AWAIT_READY(ping); @@ -3440,8 +3445,6 @@ TEST_F_TEMP_DISABLED_ON_WINDOWS(OneWayPartitionTest, MasterToSlave) Future<Nothing> deactivateSlave = FUTURE_DISPATCH(_, &MesosAllocatorProcess::deactivateSlave); - Clock::pause(); - // Inject a slave exited event at the master causing the master // to mark the slave as disconnected. The slave should not notice // until it receives the next ping message.