[ https://issues.apache.org/jira/browse/MESOS-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15354210#comment-15354210 ]
Benjamin Mahler commented on MESOS-4092: ---------------------------------------- FYI [~idownes] as part of MESOS-5576, we added the ability to force a reconnection during link: https://reviews.apache.org/r/49177/ > Try to re-establish connection on ping timeouts with agent before removing it > ----------------------------------------------------------------------------- > > Key: MESOS-4092 > URL: https://issues.apache.org/jira/browse/MESOS-4092 > Project: Mesos > Issue Type: Improvement > Components: master > Affects Versions: 0.25.0 > Reporter: Ian Downes > > The SlaveObserver will trigger an agent to be removed after > {{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}. > This can occur because of transient network failures, e.g., gray failures of > a switch uplink exhibiting heavy or total packet loss. Some network > architectures are designed to tolerate such gray failures and support > multiple paths between hosts. This can be implemented with equal-cost > multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple > possible uplinks. In such networks re-establishing a TCP connection will > almost certainly use a new source port and thus will likely be hashed to a > different uplink, avoiding the failed uplink and re-establishing connectivity > with the agent. > After failing to receive pongs the SlaveObserver should next try to > re-establish a TCP connection (with exponential back-off) before declaring > the agent as lost. This can avoid significant disruption where large numbers > of agents reached through a single failed link could be removed unnecessarily > while still ensuring that agents that are truly lost are recognized as such. -- This message was sent by Atlassian JIRA (v6.3.4#6332)