[ https://issues.apache.org/jira/browse/MESOS-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824092#comment-15824092 ]
Alexander Rojas commented on MESOS-6907: ---------------------------------------- So, after verifying my theory was correct. Timers are executed in [{{void process::timedout()}}|https://github.com/apache/mesos/blob/77ddbb62dd2ab4faaa22de8355f4766e7bbe0f2d/3rdparty/libprocess/src/process.cpp#L739]. Moreover, {{libprocess::timedout()}} is not executed in any libprocess thread, but in the libevent loop [here|https://github.com/apache/mesos/blob/77ddbb62dd2ab4faaa22de8355f4766e7bbe0f2d/3rdparty/libprocess/src/process.cpp#L898], [here|https://github.com/apache/mesos/blob/77ddbb62dd2ab4faaa22de8355f4766e7bbe0f2d/3rdparty/libprocess/src/clock.cpp#L206] and [here|https://github.com/apache/mesos/blob/77ddbb62dd2ab4faaa22de8355f4766e7bbe0f2d/3rdparty/libprocess/src/clock.cpp#L133]. What all this causes is that timers are executed in batch, and only when all the timers of a batch are executed, these timers belonging to that batch will be destroyed, which is the cause of the flakiness. It can be solved by forcing a second batch to run (since they run on the same thread every time) by creating a second timer and manipulating the {{Clock}}, so that the second timer is schedule in a different later batch and then waiting for the thunk of that timer to be executed. I proposed a patch which does just that: [r/55576/|https://reviews.apache.org/r/55576/]: Fixes FutureTest.After3 flakiness. > FutureTest.After3 is flaky > -------------------------- > > Key: MESOS-6907 > URL: https://issues.apache.org/jira/browse/MESOS-6907 > Project: Mesos > Issue Type: Bug > Components: libprocess > Reporter: Alexander Rojas > > There is apparently a race condition between the time an instance of > {{Future<T>}} goes out of scope and when the enclosing data is actually > deleted, if {{Future<T>::after(Duration, lambda::function<Future<T>(const > Future<T>&)>)}} is called. > The issue is more likely to occur if the machine is under load or if it is > not a very powerful one. The easiest way to reproduce it is to run: > {code} > $ stress -c 4 -t 2600 -d 2 -i 2 & > $ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 > --gtest_break_on_failure > {code} > An exploratory fix for the issue is to change the test to: > {code} > TEST(FutureTest, After3) > { > Future<Nothing> future; > process::WeakFuture<Nothing> weak_future(future); > EXPECT_SOME(weak_future.get()); > { > Clock::pause(); > // The original future disappears here. After this call the > // original future goes out of scope and should not be reachable > // anymore. > future = future > .after(Milliseconds(1), [](Future<Nothing> f) { > f.discard(); > return Nothing(); > }); > Clock::advance(Seconds(2)); > Clock::settle(); > AWAIT_READY(future); > } > if (weak_future.get().isSome()) { > os::sleep(Seconds(1)); > } > EXPECT_NONE(weak_future.get()); > EXPECT_FALSE(future.hasDiscard()); > } > {code} > The interesting thing of the fix is that both extra snippets are needed > (either one or the other is not enough) to prevent the issue from happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)