> On July 15, 2017, 4:31 p.m., David McLaughlin wrote: > > With the new code-path for mark unreachable after failover, this change > > introduced a non-backwards compatible change - namely that TASK_LOST > > messages for each task on the agent are no longer sent when the slaveLost > > message is sent. This means that frameworks (like Aurora) no longer get the > > signal to schedule replacements for those tasks until they reconcile. Given > > that the tasks will be marked as LOST as soon as the agent reregisters > > anyway, seems like it's easy to maintain backwards compatibility here.
There is a discussion about this on the mailing list. Would you mind incorporating your feedback there? - Vinod ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/50705/#review180639 ----------------------------------------------------------- On Sept. 12, 2016, 10:05 a.m., Neil Conway wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/50705/ > ----------------------------------------------------------- > > (Updated Sept. 12, 2016, 10:05 a.m.) > > > Review request for mesos and Vinod Kone. > > > Bugs: MESOS-4049 > https://issues.apache.org/jira/browse/MESOS-4049 > > > Repository: mesos > > > Description > ------- > > The previous behavior was to shutdown partitioned agents that attempt to > reregister---unless the master has failed over, in which case the > reregistration is allowed (when running in "non-strict" mode). > > The new behavior is always to allow partitioned agents to reregister. > This is part of a longer-term project to allow frameworks to define > their own policies for handling tasks running on partitioned agents. > > In particular, if a framework has the PARTITION_AWARE capability, any > tasks running on the partitioned agent will continue to run after > reregistration. If the framework is not PARTITION_AWARE, any tasks that > were running on such an agent will be killed after the agent reregisters > (unless the master has failed over). This is for backward compatibility > with the previous ("non-strict") behavior. Note that regardless of the > PARTITION_AWARE capability, the agent will not be shutdown, which is a > change from the previous Mesos behavior. > > This commit also changes the master so that if an agent is removed and > then the master receives a message from that agent, the master will no > longer attempt to shutdown the agent. This is consistent with the goal > of getting the master out of the business of shutting down agents that > we suspect are unhealthy. Such an agent will eventually realize it is > not registered with the master (e.g., because it won't receive any pings > from the master), which will cause it to reregister. > > > Diffs > ----- > > src/master/master.hpp 4992ab0a0bb5babbf6a4fa3e6eff3577590fc879 > src/master/master.cpp 1dcce6cd66804990af238176c61aca03bb5c9471 > src/tests/master_tests.cpp 6cde15fcd6ca8ec40438c75aed980e83f8de9b86 > src/tests/partition_tests.cpp f3142ad8d50daafcdb70ad9dbb2772f8ba30db00 > > > Diff: https://reviews.apache.org/r/50705/diff/10/ > > > Testing > ------- > > make check > > > Thanks, > > Neil Conway > >