[ 
https://issues.apache.org/jira/browse/MESOS-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822216#comment-15822216
 ] 

Alexander Rojas commented on MESOS-6907:
----------------------------------------

>From the behavior of the tests, and the snippets that _fix_ the issue, there 
>must be someone keeping around a reference to {{future.data}} for longer than 
>expected. The known references are kept in copies of the future: [one in the 
>caller|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/tests/future_tests.cpp#L275-L279],
> which is destroyed with the call to after. The [other copy of the 
>future|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1481]
> is kept in a {{Timer}} instance. However copies of this timer are moved 
>around. One copy of the timer is control by the {{future}} itself, and it is 
>stored in the vector of [{{onAny()}} 
>callbacks|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1483].
> This copy of the timer is destroyed when the [timer 
>expires|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1345]
> or when the original [future is 
>satisfied|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1371-L1372]
> (which doesn't happen in this test).

There is at least one known more copy of the {{timer}} which is kept by the 
[{{Clock}} 
itself|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/clock.cpp#L281-L294].
 However, libprocess itself gets involved managing the lifetime of the timers 
through a [callback 
function|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/clock.cpp#L70-L71]
 which is set when [libprocess is 
starting|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L1069].
 

My theory is that libprocess is keeping this callbacks for longer than 
expected, but so far I haven't been able to prove it. However, I think is 
perfectly normal that this behavior occurs and probably the test needs to be 
updated (this last paragraph is just a conjecture at this point).

> FutureTest.After3 is flaky
> --------------------------
>
>                 Key: MESOS-6907
>                 URL: https://issues.apache.org/jira/browse/MESOS-6907
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>            Reporter: Alexander Rojas
>
> There is apparently a race condition between the time an instance of 
> {{Future<T>}} goes out of scope and when the enclosing data is actually 
> deleted, if {{Future<T>::after(Duration, lambda::function<Future<T>(const 
> Future<T>&)>)}} is called.
> The issue is more likely to occur if the machine is under load or if it is 
> not a very powerful one. The easiest way to reproduce it is to run:
> {code}
> $ stress -c 4 -t 2600 -d 2 -i 2 &
> $ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 
> --gtest_break_on_failure
> {code}
> An exploratory fix for the issue is to change the test to:
> {code}
> TEST(FutureTest, After3)
> {
>   Future<Nothing> future;
>   process::WeakFuture<Nothing> weak_future(future);
>   EXPECT_SOME(weak_future.get());
>   {
>     Clock::pause();
>     // The original future disappears here. After this call the
>     // original future goes out of scope and should not be reachable
>     // anymore.
>     future = future
>       .after(Milliseconds(1), [](Future<Nothing> f) {
>         f.discard();
>         return Nothing();
>       });
>     Clock::advance(Seconds(2));
>     Clock::settle();
>     AWAIT_READY(future);
>   }
>   if (weak_future.get().isSome()) {
>     os::sleep(Seconds(1));
>   }
>   EXPECT_NONE(weak_future.get());
>   EXPECT_FALSE(future.hasDiscard());
> }
> {code}
> The interesting thing of the fix is that both extra snippets are needed 
> (either one or the other is not enough) to prevent the issue from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to