Ian Downes created MESOS-4092: --------------------------------- Summary: Try to re-establish connection on ping timeouts with agent before removing it Key: MESOS-4092 URL: https://issues.apache.org/jira/browse/MESOS-4092 Project: Mesos Issue Type: Improvement Components: master Affects Versions: 0.25.0 Reporter: Ian Downes
The SlaveObserver will trigger an agent to be removed after {{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}. This can occur because of transient network failures, e.g., gray failures of a switch uplink exhibiting heavy or total packet loss. Some network architectures are designed to tolerate such gray failures and support multiple paths between hosts. This can be implemented with equal-cost multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple possible uplinks. In such networks re-establishing a TCP connection will almost certainly use a new source port and thus will likely be hashed to a different uplink, avoiding the failed uplink and re-establishing connectivity with the agent. After failing to receive pongs the SlaveObserver should next try to re-establish a TCP connection (with exponential back-off) before declaring the agent as lost. This can avoid significant disruption where large numbers of agents reached through a single failed link could be removed unnecessarily while still ensuring that agents that are truly lost are recognized as such. -- This message was sent by Atlassian JIRA (v6.3.4#6332)