[ 
https://issues.apache.org/jira/browse/MESOS-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047385#comment-15047385
 ] 

Benjamin Mahler edited comment on MESOS-4048 at 12/8/15 8:06 PM:
-----------------------------------------------------------------

This ticket is independent from MESOS-4049 in that it is discussing the current 
inconsistent approaches to agent partition detection (case 1 and 2 above).

When we were implementing master recovery, we wanted to use health checking to 
determine when an agent is unhealthy, but there were some implementation 
difficulties that led to the addition of {{\-\-slave_reregistration_timer}} 
instead. This approach is a bit scary because we may remove healthy agents that 
for some reason (e.g. ZK connectivity issues) could not re-register with the 
master after master failover. This was why we put in place some safety nets 
({{\-\-recovery_slave_removal_limit}} and we were able to re-use used the 
removal rate limiting).

The point of this ticket is to look into removing 
{{\-\-slave_reregistration_timer}} entirely and have the master perform the 
same health check based partition detection that it does in the steady state.

So, MESOS-4049 is about what we do *when* an agent is unhealthy. This ticket is 
about *how* we determine that an agent is unhealthy. Specifically, we want to 
determine it in a consistent way rather than having one approach in steady 
state and a different approach after master failover.

Make sense?


was (Author: bmahler):
This ticket is independent from MESOS-4049 in that it is discussing the current 
inconsistent approaches to agent partition handling (case 1 and 2 above).

When we were implementing master recovery, we wanted to use health checking to 
determine when an agent should be removed, but there were some implementation 
difficulties that led to the addition of {{--slave_reregistration_timer}} 
instead. This approach is a bit scary because we may remove healthy agents that 
for some reason (e.g. ZK connectivity issues) could not re-register with the 
master after master failover. This was why we put in place some safety nets 
({{--recovery_slave_removal_limit}} and we were able to re-use used the removal 
rate limiting).

The point of this ticket is to look into removing 
{{--slave_reregistration_timer}} entirely and have the master perform the same 
health check based partition detection that it does in the steady state.

So, MESOS-4049 is about what we do *when* an agent is unhealthy (e.g. 
partitioned). This ticket is about *how* we determine that an agent is 
unhealthy (e.g. partitioned). Specifically, we want to determine it in a 
consistent way rather than having one approach in steady state and a different 
approach after master failover.

Make sense?

> Consider unifying slave timeout behavior between steady state and master 
> failover
> ---------------------------------------------------------------------------------
>
>                 Key: MESOS-4048
>                 URL: https://issues.apache.org/jira/browse/MESOS-4048
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master, slave
>            Reporter: Neil Conway
>            Assignee: Anindya Sinha
>            Priority: Minor
>              Labels: mesosphere
>
> Currently, there are two timeouts that control what happens when an agent is 
> partitioned from the master:
> 1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the 
> master waits before declaring a slave to be dead in the "steady state"
> 2. {{slave_reregister_timeout}} controls how long the master waits for a 
> slave to reregister after master failover.
> It is unclear whether these two cases really merit being treated differently 
> -- it might be simpler for operators to configure a single timeout that 
> controls how long the master waits before declaring that a slave is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to