----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61473/#review186580 -----------------------------------------------------------
src/master/master.hpp Line 849 (original), 849 (patched) <https://reviews.apache.org/r/61473/#comment263220> Undo this change. src/master/master.hpp Line 2467 (original), 2464 (patched) <https://reviews.apache.org/r/61473/#comment263221> Undo this change. src/master/master.cpp Line 6552 (original) <https://reviews.apache.org/r/61473/#comment263296> `protobuf::isTerminalState` still defines `TASK_LOST` as a terminal state. However we are now refining it as non-terminal, so that seems inconsistent. I didn't see any discussion about why frameworks that are not partition-aware would be prepared to handle a transition from `TASK_LOST` to `TASK_RUNNING`. src/master/master.cpp Line 7226 (original), 7182 (patched) <https://reviews.apache.org/r/61473/#comment263294> You can move the `StatusUpdate` change into this block: ``` } else { StatusUpdate _update(update); // Comment explaining why we are doing this ... if (!framework->capabilities.partitionAware) { _update.mutable_status()->set_state(TASK_LOST); } forward(...); } ``` src/master/master.cpp Lines 9002 (patched) <https://reviews.apache.org/r/61473/#comment263293> This does nothing? src/master/master.cpp Line 9064 (original), 9021 (patched) <https://reviews.apache.org/r/61473/#comment263225> Undo this change. src/master/master.cpp Lines 9476 (patched) <https://reviews.apache.org/r/61473/#comment263295> This doesn't seem right to me. Even if the framework is not partition-aware, the master is still tracking these tasks as unreachable. I can't see a reason why we would want to fudge the metrics in this case. - James Peach On Sept. 24, 2017, 10:46 p.m., Megha Sharma wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61473/ > ----------------------------------------------------------- > > (Updated Sept. 24, 2017, 10:46 p.m.) > > > Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu. > > > Bugs: MESOS-7215 > https://issues.apache.org/jira/browse/MESOS-7215 > > > Repository: mesos > > > Description > ------- > > Master will not kill the tasks for non-Partition aware frameworks > when an unreachable agent re-registers with the master. > Master used to send a ShutdownFrameworkMessages to the agent > to kill the tasks from non partition aware frameworks including the > ones that are still registered which was problematic because the offer > from this agent could still go to the same framework which could then > launch new tasks. The agent would then receive tasks of the same > framework and ignore them because it thinks the framework is shutting > down. The framework is not shutting down of course, so from the master > and the scheduler’s perspective the task is pending in STAGING forever > until the next agent reregistration, which could happen much later. > This commit fixes the problem by not shutting down the non-partition > aware frameworks on such an agent. > > > Diffs > ----- > > src/master/http.cpp 28d0393fb5962df4d731521265efd81a54e1e655 > src/master/master.hpp 05f88111afb4fa0e2baf57106e1479914c16a113 > src/master/master.cpp 6d84a26bff970b842b58dfb69dbf232ba5c16a20 > src/tests/partition_tests.cpp 0886f4890ac3fec6f38146946892769a99c3e68f > > > Diff: https://reviews.apache.org/r/61473/diff/7/ > > > Testing > ------- > > make check > > > Thanks, > > Megha Sharma > >