[ https://issues.apache.org/jira/browse/MESOS-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047385#comment-15047385 ]
Benjamin Mahler edited comment on MESOS-4048 at 12/8/15 8:06 PM: ----------------------------------------------------------------- This ticket is independent from MESOS-4049 in that it is discussing the current inconsistent approaches to agent partition detection (case 1 and 2 above). When we were implementing master recovery, we wanted to use health checking to determine when an agent is unhealthy, but there were some implementation difficulties that led to the addition of {{\-\-slave_reregistration_timer}} instead. This approach is a bit scary because we may remove healthy agents that for some reason (e.g. ZK connectivity issues) could not re-register with the master after master failover. This was why we put in place some safety nets ({{\-\-recovery_slave_removal_limit}} and we were able to re-use used the removal rate limiting). The point of this ticket is to look into removing {{\-\-slave_reregistration_timer}} entirely and have the master perform the same health check based partition detection that it does in the steady state. So, MESOS-4049 is about what we do *when* an agent is unhealthy. This ticket is about *how* we determine that an agent is unhealthy. Specifically, we want to determine it in a consistent way rather than having one approach in steady state and a different approach after master failover. Make sense? was (Author: bmahler): This ticket is independent from MESOS-4049 in that it is discussing the current inconsistent approaches to agent partition handling (case 1 and 2 above). When we were implementing master recovery, we wanted to use health checking to determine when an agent should be removed, but there were some implementation difficulties that led to the addition of {{--slave_reregistration_timer}} instead. This approach is a bit scary because we may remove healthy agents that for some reason (e.g. ZK connectivity issues) could not re-register with the master after master failover. This was why we put in place some safety nets ({{--recovery_slave_removal_limit}} and we were able to re-use used the removal rate limiting). The point of this ticket is to look into removing {{--slave_reregistration_timer}} entirely and have the master perform the same health check based partition detection that it does in the steady state. So, MESOS-4049 is about what we do *when* an agent is unhealthy (e.g. partitioned). This ticket is about *how* we determine that an agent is unhealthy (e.g. partitioned). Specifically, we want to determine it in a consistent way rather than having one approach in steady state and a different approach after master failover. Make sense? > Consider unifying slave timeout behavior between steady state and master > failover > --------------------------------------------------------------------------------- > > Key: MESOS-4048 > URL: https://issues.apache.org/jira/browse/MESOS-4048 > Project: Mesos > Issue Type: Improvement > Components: master, slave > Reporter: Neil Conway > Assignee: Anindya Sinha > Priority: Minor > Labels: mesosphere > > Currently, there are two timeouts that control what happens when an agent is > partitioned from the master: > 1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the > master waits before declaring a slave to be dead in the "steady state" > 2. {{slave_reregister_timeout}} controls how long the master waits for a > slave to reregister after master failover. > It is unclear whether these two cases really merit being treated differently > -- it might be simpler for operators to configure a single timeout that > controls how long the master waits before declaring that a slave is dead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)