[ https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dominic Hamon updated MESOS-1668: --------------------------------- Sprint: Mesos Q3 Sprint 5, Mesos Q3 Sprint 6 (was: Mesos Q3 Sprint 5) > Handle a temporary one-way master --> slave socket closure. > ----------------------------------------------------------- > > Key: MESOS-1668 > URL: https://issues.apache.org/jira/browse/MESOS-1668 > Project: Mesos > Issue Type: Bug > Components: master, slave > Reporter: Benjamin Mahler > Assignee: Vinod Kone > Priority: Minor > Labels: reliability > > In MESOS-1529, we realized that it's possible for a slave to remain > disconnected in the master if the following occurs: > → Master and Slave connected operating normally. > → Temporary one-way network failure, master→slave link breaks. > → Master marks slave as disconnected. > → Network restored and health checking continues normally, slave is not > removed as a result. Slave does not attempt to re-register since it is > receiving pings once again. > → Slave remains disconnected according to the master, and the slave does not > try to re-register. Bad! > We were originally thinking of using a failover timeout in the master to > remove these slaves that don't re-register. However, it can be dangerous when > ZooKeeper issues are preventing the slave from re-registering with the > master; we do not want to remove a ton of slaves in this situation. > Rather, when the slave is health checking correctly but does not re-register > within a timeout, we could send a registration request from the master to the > slave, telling the slave that it must re-register. This message could also be > used when receiving status updates (or other messages) from slaves that are > disconnected in the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)