> On Sept. 15, 2014, 9:17 p.m., Jie Yu wrote: > > This more looks like a bug somewhere. Let's try to find out the true root > > cause first. > > Niklas Nielsen wrote: > True - but throwing the result away from await() seems silly/unsafe too > (rather on relying on the await to have completed or wait forever). Our > investigation ended in > https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/latch.cpp#L58 > (as explained in the RR description). > We haven't been able to realiably reproduce the fault to debug it - > Connor, can you fill in on when it happened in Marathon? > > Connor Doyle wrote: > Sure, will try to reproduce with more verbose logging on a fresh cluster > tomorrow. > > Ben Mahler wrote: > The semantics of Future::await() are such that if no Duration is > provided, then it will never return false: > > ``` > f.await(); // Always returns true. > // Now the future should not be pending! > ``` > > It should only return false as the result of a timeout. > > That's why Jie is saying we need to understand the bug, it's not even > clear whether await returned false (which should not occur) vs. returned true > but the CHECK_READY still saw the future as pending. > > Niklas Nielsen wrote: > Connor, do you want to close this issue for now and continue the > investigation in the JIRA ticket? We can reopen if this is indeed the issue. > > Thanks
Yes, that seems appropriate as this patch seems unlikely to be accepted in its current state. - Connor ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25614/#review53409 ----------------------------------------------------------- On Sept. 15, 2014, 8:54 p.m., Connor Doyle wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/25614/ > ----------------------------------------------------------- > > (Updated Sept. 15, 2014, 8:54 p.m.) > > > Review request for mesos and Niklas Nielsen. > > > Bugs: MESOS-1795 > https://issues.apache.org/jira/browse/MESOS-1795 > > > Repository: mesos-git > > > Description > ------- > > ## OVERVIEW > > This change adds a measure of safety for calls to `Future<T>::await()` > without a timeout from within the JNI interface to the state > abstraction. > > ## MOTIVATION > > Observed the following log output prior to a crash of the Marathon scheduler: > > ``` > Sep 12 23:46:01 highly-available-457-540 marathon[11494]: F0912 > 23:46:01.771927 11532 org_apache_mesos_state_AbstractState.cpp:145] > CHECK_READY(*future): is PENDING > Sep 12 23:46:01 highly-available-457-540 marathon[11494]: *** Check failure > stack trace: *** > Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ > 0x7febc2663a2d google::LogMessage::Fail() > Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ > 0x7febc26657e3 google::LogMessage::SendToLog() > Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ > 0x7febc2663648 google::LogMessage::Flush() > Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ > 0x7febc266603e google::LogMessageFatal::~LogMessageFatal() > Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ > 0x7febc26588a3 Java_org_apache_mesos_state_AbstractState__1_1fetch_1get > Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ > 0x7febcd107d98 (unknown) > ``` > _Listing 1: Crash log output._ > > ## CHANGES IN DETAIL > > The in-line comments in `Latch::await(const Duration& duration)`, mention > that it's possible for the call to return before the underlying process > has completed for two reasons: > > ``` > ... > process::wait(pid, duration); // Explict to disambiguate. > // It's possible that we failed to wait because: > // (1) Our process has already terminated. > // (2) We timed out (i.e., duration was not "infinite"). > ... > ``` > _Listing 2: Excerpt from 3rdparty/src/libprocess/latch.cpp._ > > Call sites within `org_apache_mesos_state_AbstractState.cpp` > that formerly discarded the return value now throw a > `java.concurrent.ExecutionException` if the await fails. I chose to throw > this instead of a `java.concurrent.TimeoutException` since the expectation > from the caller's perspective is to block forever pending a ready state. > > > Diffs > ----- > > src/java/jni/org_apache_mesos_state_AbstractState.cpp 1accc8a > > Diff: https://reviews.apache.org/r/25614/diff/ > > > Testing > ------- > > - make > - make check > > > Thanks, > > Connor Doyle > >