[ https://issues.apache.org/jira/browse/MESOS-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822216#comment-15822216 ]
Alexander Rojas commented on MESOS-6907: ---------------------------------------- >From the behavior of the tests, and the snippets that _fix_ the issue, there >must be someone keeping around a reference to {{future.data}} for longer than >expected. The known references are kept in copies of the future: [one in the >caller|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/tests/future_tests.cpp#L275-L279], > which is destroyed with the call to after. The [other copy of the >future|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1481] > is kept in a {{Timer}} instance. However copies of this timer are moved >around. One copy of the timer is control by the {{future}} itself, and it is >stored in the vector of [{{onAny()}} >callbacks|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1483]. > This copy of the timer is destroyed when the [timer >expires|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1345] > or when the original [future is >satisfied|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1371-L1372] > (which doesn't happen in this test). There is at least one known more copy of the {{timer}} which is kept by the [{{Clock}} itself|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/clock.cpp#L281-L294]. However, libprocess itself gets involved managing the lifetime of the timers through a [callback function|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/clock.cpp#L70-L71] which is set when [libprocess is starting|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L1069]. My theory is that libprocess is keeping this callbacks for longer than expected, but so far I haven't been able to prove it. However, I think is perfectly normal that this behavior occurs and probably the test needs to be updated (this last paragraph is just a conjecture at this point). > FutureTest.After3 is flaky > -------------------------- > > Key: MESOS-6907 > URL: https://issues.apache.org/jira/browse/MESOS-6907 > Project: Mesos > Issue Type: Bug > Components: libprocess > Reporter: Alexander Rojas > > There is apparently a race condition between the time an instance of > {{Future<T>}} goes out of scope and when the enclosing data is actually > deleted, if {{Future<T>::after(Duration, lambda::function<Future<T>(const > Future<T>&)>)}} is called. > The issue is more likely to occur if the machine is under load or if it is > not a very powerful one. The easiest way to reproduce it is to run: > {code} > $ stress -c 4 -t 2600 -d 2 -i 2 & > $ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 > --gtest_break_on_failure > {code} > An exploratory fix for the issue is to change the test to: > {code} > TEST(FutureTest, After3) > { > Future<Nothing> future; > process::WeakFuture<Nothing> weak_future(future); > EXPECT_SOME(weak_future.get()); > { > Clock::pause(); > // The original future disappears here. After this call the > // original future goes out of scope and should not be reachable > // anymore. > future = future > .after(Milliseconds(1), [](Future<Nothing> f) { > f.discard(); > return Nothing(); > }); > Clock::advance(Seconds(2)); > Clock::settle(); > AWAIT_READY(future); > } > if (weak_future.get().isSome()) { > os::sleep(Seconds(1)); > } > EXPECT_NONE(weak_future.get()); > EXPECT_FALSE(future.hasDiscard()); > } > {code} > The interesting thing of the fix is that both extra snippets are needed > (either one or the other is not enough) to prevent the issue from happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)