-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25911/
-----------------------------------------------------------

(Updated Sept. 24, 2014, 3:04 p.m.)


Review request for mesos and Ben Mahler.


Changes
-------

Based on r25967 refactor


Bugs: MESOS-1817
    https://issues.apache.org/jira/browse/MESOS-1817


Repository: mesos-git


Description
-------

We have run into a problem that cause tasks which completes, when a
framework is disconnected and has a fail-over time, to remain in a
running state even though the tasks actually finishes. This hogs the
cluster and gives users a inconsistent view of the cluster state.

The problem turn out to be an issue with the ack-cycle of status
updates: If the framework disconnects (with a failover timeout set), the
status update manage on the slaves will keep trying to send the front of
status update stream to the master (which in turn forwards it to the
framework). If the first status update after the disconnect is terminal,
things work out fine; the master picks the terminal state up, removes
the task and release the resources. If, on the other hand, one
non-terminal status is in the stream. The master will never know that
the task finished (or failed) before the framework reconnects.

As a first pass, this patch makes the status update manager inform the
master if a terminal state was found in the pending stream of a task.
If so, the master will recover the resources but will still wait the
updates to arrive before updating the task state and statuses.


Diffs (updated)
-----

  src/master/master.hpp f5d74ae 
  src/master/master.cpp e5d30e9 
  src/messages/messages.proto 7cb3ce6 
  src/slave/status_update_manager.hpp 24e3882 
  src/slave/status_update_manager.cpp 5d5cf23 
  src/tests/fault_tolerance_tests.cpp 1543860 

Diff: https://reviews.apache.org/r/25911/diff/


Testing
-------

Added a new test: FaultToleranceTest.RecoverResourcesDuringSchedulerDisconnect 
which exercise the new code path.

make check


Thanks,

Niklas Nielsen

Reply via email to