Re: Review Request 50705: Changed master to allow partitioned slaves to reregister.

Vinod Kone Mon, 17 Jul 2017 10:53:07 -0700


> On July 15, 2017, 4:31 p.m., David McLaughlin wrote:
> > With the new code-path for mark unreachable after failover, this change 
> > introduced a non-backwards compatible change - namely that TASK_LOST 
> > messages for each task on the agent are no longer sent when the slaveLost 
> > message is sent. This means that frameworks (like Aurora) no longer get the 
> > signal to schedule replacements for those tasks until they reconcile. Given 
> > that the tasks will be marked as LOST as soon as the agent reregisters 
> > anyway, seems like it's easy to maintain backwards compatibility here.


There is a discussion about this on the mailing list. Would you mind 
incorporating your feedback there?


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50705/#review180639
-----------------------------------------------------------


On Sept. 12, 2016, 10:05 a.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50705/
> -----------------------------------------------------------
> 
> (Updated Sept. 12, 2016, 10:05 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-4049
>     https://issues.apache.org/jira/browse/MESOS-4049
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The previous behavior was to shutdown partitioned agents that attempt to
> reregister---unless the master has failed over, in which case the
> reregistration is allowed (when running in "non-strict" mode).
> 
> The new behavior is always to allow partitioned agents to reregister.
> This is part of a longer-term project to allow frameworks to define
> their own policies for handling tasks running on partitioned agents.
> 
> In particular, if a framework has the PARTITION_AWARE capability, any
> tasks running on the partitioned agent will continue to run after
> reregistration. If the framework is not PARTITION_AWARE, any tasks that
> were running on such an agent will be killed after the agent reregisters
> (unless the master has failed over). This is for backward compatibility
> with the previous ("non-strict") behavior. Note that regardless of the
> PARTITION_AWARE capability, the agent will not be shutdown, which is a
> change from the previous Mesos behavior.
> 
> This commit also changes the master so that if an agent is removed and
> then the master receives a message from that agent, the master will no
> longer attempt to shutdown the agent. This is consistent with the goal
> of getting the master out of the business of shutting down agents that
> we suspect are unhealthy. Such an agent will eventually realize it is
> not registered with the master (e.g., because it won't receive any pings
> from the master), which will cause it to reregister.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 4992ab0a0bb5babbf6a4fa3e6eff3577590fc879 
>   src/master/master.cpp 1dcce6cd66804990af238176c61aca03bb5c9471 
>   src/tests/master_tests.cpp 6cde15fcd6ca8ec40438c75aed980e83f8de9b86 
>   src/tests/partition_tests.cpp f3142ad8d50daafcdb70ad9dbb2772f8ba30db00 
> 
> 
> Diff: https://reviews.apache.org/r/50705/diff/10/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Neil Conway
> 
>

Re: Review Request 50705: Changed master to allow partitioned slaves to reregister.

Reply via email to