> On Dec. 20, 2016, 4:41 p.m., Neil Conway wrote: > > Can we add an expectation for exactly when we expect to receive the offer? > > i.e., another `WillOnce(FutureSatisfy(...))` and then wait for the future > > at the appropriate time. > > > > The thing I don't like about the `WillRepeatedly(...)` pattern is that it > > obscures whether additional offers are actually expected or not. Lots of > > tests have copied this pattern and use it, even when another offer is not > > actually expected. > > Benjamin Bannier wrote: > This test only uses a single offer, and we do not know when or even if a > second offer would arrive since the test does not run with a paused clock, > https://github.com/apache/mesos/blob/1e72605e9892eb4e518442ab9c1fe2a1a1696748/src/tests/fault_tolerance_tests.cpp#L853. > If execution on the thread running in the test body is delayed sufficiently > after the clock is resumed, additional offers might be made concurrently with > that part of the test body. > > It looks to me that as a general rule of thumb, if tests do not pause the > clock, one should always ignore subsequent calls to mock interfaces in order > to avoid this gmock assertion. > > Neil Conway wrote: > Right, although there are plenty of other situations in which _arbitrary_ > delays in tests will cause failures (e.g., exceeding the registry read/write > timeouts, or even causing "sleep 60" dummy tasks to exit). So I'm not sure > that this problem is unique to making offers. > > It would be good to explore pausing the clock by default for all tests > (MESOS-4101). But in the short term, what about making the default allocation > interval much higher for tests? Say 60 seconds. Test code should generally > not be depending on batch allocations anyway (because it needlessly slows > down the test), so this would make such poorly written tests easier to find. > > Benjamin Bannier wrote: > Yes, that makes sense. > > Would you mind taking this over? You already started fixing this test (it > wasn't pretty before, but also not flaky then), but we learnt in the meantime > and it would be great if we would fix this test completely. Like I already > mentioned to you offline, it would be a good idea to also think a little > harder about your approach to comparing doubles > (https://github.com/apache/mesos/blob/d2117362349ab4c383045720f77d42b2d9fd6871/src/tests/fault_tolerance_tests.cpp#L881-L888), > see https://issues.apache.org/jira/browse/MESOS-4695 for the kind of > failures this pattern can cause.
I think the current fix is fine for now. I'll take a look at the floating point comparison stuff and perhaps changing the `allocation_interval` later. - Neil ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/54889/#review159727 ----------------------------------------------------------- On Dec. 20, 2016, 10:15 p.m., Benjamin Bannier wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/54889/ > ----------------------------------------------------------- > > (Updated Dec. 20, 2016, 10:15 p.m.) > > > Review request for mesos, Alexander Rukletsov and Neil Conway. > > > Bugs: MESOS-6820 > https://issues.apache.org/jira/browse/MESOS-6820 > > > Repository: mesos > > > Description > ------- > > The test FaultToleranceTest.FrameworkReregister observes a single > offer in order to monitor progress and as a trigger to initiate other > actions later in the test. The change installs expectations for > possible additional offers. This allows for the thread running the > test body to be delayed with respect to the master thread which could > in the meantime make additional offers. > > > Diffs > ----- > > src/tests/fault_tolerance_tests.cpp > 05937a917a2c175aa53b52488febb7cfd8400a13 > > Diff: https://reviews.apache.org/r/54889/diff/ > > > Testing > ------- > > `make check` (OS X, clang-trunk, w/ optimizations, SSL) > > Before this fix `FaultToleranceTest.FrameworkReregister` failed for me > multiple times in ~10k iterations; after this patch it didn't fail in a > single run I stopped after 170k iterations. > > > Thanks, > > Benjamin Bannier > >