----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25911/#review54479 -----------------------------------------------------------
Bad patch! Reviews applied: [25967, 25911] Failed command: ./support/mesos-style.py Error: Checking 509 files using filter --filter=-,+build/class,+build/deprecated,+build/endif_comment,+readability/todo,+readability/namespace,+runtime/vlog,+whitespace/blank_line,+whitespace/comma,+whitespace/end_of_line,+whitespace/ending_newline,+whitespace/forcolon,+whitespace/indent,+whitespace/line_length,+whitespace/tab,+whitespace/todo src/slave/status_update_manager.cpp:189: Lines should be <= 80 characters long [whitespace/line_length] [2] src/slave/status_update_manager.hpp:299: Redundant blank line at the start of a code block should be deleted. [whitespace/blank_line] [2] Total errors found: 2 - Mesos ReviewBot On Sept. 24, 2014, 10:04 p.m., Niklas Nielsen wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/25911/ > ----------------------------------------------------------- > > (Updated Sept. 24, 2014, 10:04 p.m.) > > > Review request for mesos and Ben Mahler. > > > Bugs: MESOS-1817 > https://issues.apache.org/jira/browse/MESOS-1817 > > > Repository: mesos-git > > > Description > ------- > > We have run into a problem that cause tasks which completes, when a > framework is disconnected and has a fail-over time, to remain in a > running state even though the tasks actually finishes. This hogs the > cluster and gives users a inconsistent view of the cluster state. > > The problem turn out to be an issue with the ack-cycle of status > updates: If the framework disconnects (with a failover timeout set), the > status update manage on the slaves will keep trying to send the front of > status update stream to the master (which in turn forwards it to the > framework). If the first status update after the disconnect is terminal, > things work out fine; the master picks the terminal state up, removes > the task and release the resources. If, on the other hand, one > non-terminal status is in the stream. The master will never know that > the task finished (or failed) before the framework reconnects. > > As a first pass, this patch makes the status update manager inform the > master if a terminal state was found in the pending stream of a task. > If so, the master will recover the resources but will still wait the > updates to arrive before updating the task state and statuses. > > > Diffs > ----- > > src/master/master.hpp f5d74ae > src/master/master.cpp e5d30e9 > src/messages/messages.proto 7cb3ce6 > src/slave/status_update_manager.hpp 24e3882 > src/slave/status_update_manager.cpp 5d5cf23 > src/tests/fault_tolerance_tests.cpp 1543860 > > Diff: https://reviews.apache.org/r/25911/diff/ > > > Testing > ------- > > Added a new test: > FaultToleranceTest.RecoverResourcesDuringSchedulerDisconnect which exercise > the new code path. > > make check > > > Thanks, > > Niklas Nielsen > >