> On Sept. 15, 2014, 2:17 p.m., Jie Yu wrote:
> > This more looks like a bug somewhere. Let's try to find out the true root 
> > cause first.
> 
> Niklas Nielsen wrote:
>     True - but throwing the result away from await() seems silly/unsafe too 
> (rather on relying on the await to have completed or wait forever). Our 
> investigation ended in 
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/latch.cpp#L58
>  (as explained in the RR description).
>     We haven't been able to realiably reproduce the fault to debug it - 
> Connor, can you fill in on when it happened in Marathon?
> 
> Connor Doyle wrote:
>     Sure, will try to reproduce with more verbose logging on a fresh cluster 
> tomorrow.
> 
> Ben Mahler wrote:
>     The semantics of Future::await() are such that if no Duration is 
> provided, then it will never return false:
>     
>     ```
>       f.await(); // Always returns true.
>       // Now the future should not be pending!
>     ```
>     
>     It should only return false as the result of a timeout.
>     
>     That's why Jie is saying we need to understand the bug, it's not even 
> clear whether await returned false (which should not occur) vs. returned true 
> but the CHECK_READY still saw the future as pending.

Connor, do you want to close this issue for now and continue the investigation 
in the JIRA ticket? We can reopen if this is indeed the issue.

Thanks


- Niklas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25614/#review53409
-----------------------------------------------------------


On Sept. 15, 2014, 1:54 p.m., Connor Doyle wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25614/
> -----------------------------------------------------------
> 
> (Updated Sept. 15, 2014, 1:54 p.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1795
>     https://issues.apache.org/jira/browse/MESOS-1795
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> ## OVERVIEW
> 
> This change adds a measure of safety for calls to `Future<T>::await()`
> without a timeout from within the JNI interface to the state
> abstraction.
> 
> ## MOTIVATION
> 
> Observed the following log output prior to a crash of the Marathon scheduler:
> 
> ```
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]: F0912 
> 23:46:01.771927 11532 org_apache_mesos_state_AbstractState.cpp:145] 
> CHECK_READY(*future): is PENDING 
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]: *** Check failure 
> stack trace: ***
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     
> 0x7febc2663a2d  google::LogMessage::Fail()
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     
> 0x7febc26657e3  google::LogMessage::SendToLog()
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     
> 0x7febc2663648  google::LogMessage::Flush()
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     
> 0x7febc266603e  google::LogMessageFatal::~LogMessageFatal()
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     
> 0x7febc26588a3  Java_org_apache_mesos_state_AbstractState__1_1fetch_1get
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     
> 0x7febcd107d98  (unknown)
> ```
> _Listing 1: Crash log output._
> 
> ## CHANGES IN DETAIL
> 
> The in-line comments in `Latch::await(const Duration& duration)`, mention
> that it's possible for the call to return before the underlying process
> has completed for two reasons:
> 
> ```
> ...
> process::wait(pid, duration); // Explict to disambiguate.
> // It's possible that we failed to wait because:
> //   (1) Our process has already terminated.
> //   (2) We timed out (i.e., duration was not "infinite").
> ...
> ```
> _Listing 2: Excerpt from 3rdparty/src/libprocess/latch.cpp._
> 
> Call sites within `org_apache_mesos_state_AbstractState.cpp`
> that formerly discarded the return value now throw a
> `java.concurrent.ExecutionException` if the await fails.  I chose to throw
> this instead of a `java.concurrent.TimeoutException` since the expectation
> from the caller's perspective is to block forever pending a ready state.
> 
> 
> Diffs
> -----
> 
>   src/java/jni/org_apache_mesos_state_AbstractState.cpp 1accc8a 
> 
> Diff: https://reviews.apache.org/r/25614/diff/
> 
> 
> Testing
> -------
> 
> - make
> - make check
> 
> 
> Thanks,
> 
> Connor Doyle
> 
>

Reply via email to