Benjamin Mahler created MESOS-1668: -------------------------------------- Summary: Handle a temporary one-way master --> slave socket closure. Key: MESOS-1668 URL: https://issues.apache.org/jira/browse/MESOS-1668 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler Priority: Minor
In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs: → Master and Slave connected operating normally. → Temporary one-way network failure, master→slave link breaks. → Master marks slave as disconnected. → Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again. → Slave remains disconnected according to the master, and the slave does not try to re-register. Bad! We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. -- This message was sent by Atlassian JIRA (v6.2#6252)