[jira] [Updated] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.
[ https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1668: - Sprint: Mesos Q3 Sprint 5, Mesos Q3 Sprint 6 (was: Mesos Q3 Sprint 5) Handle a temporary one-way master -- slave socket closure. --- Key: MESOS-1668 URL: https://issues.apache.org/jira/browse/MESOS-1668 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler Assignee: Vinod Kone Priority: Minor Labels: reliability In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs: → Master and Slave connected operating normally. → Temporary one-way network failure, master→slave link breaks. → Master marks slave as disconnected. → Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again. → Slave remains disconnected according to the master, and the slave does not try to re-register. Bad! We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.
[ https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1668: -- Shepherd: Benjamin Mahler https://reviews.apache.org/r/25867/ Handle a temporary one-way master -- slave socket closure. --- Key: MESOS-1668 URL: https://issues.apache.org/jira/browse/MESOS-1668 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler Assignee: Vinod Kone Priority: Minor Labels: reliability In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs: → Master and Slave connected operating normally. → Temporary one-way network failure, master→slave link breaks. → Master marks slave as disconnected. → Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again. → Slave remains disconnected according to the master, and the slave does not try to re-register. Bad! We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.
[ https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1668: -- Sprint: Mesos Q3 Sprint 5 Assignee: Vinod Kone Story Points: 2 The plan is to handle this by piggybacking the current slave state (e.g., bool registered) on the ping/pong messages. When the slave receives a ping message which says that the master thinks the slave is disconnected but slave doesn't know it yet (socket only broke on the master side), slave will attempt a re-registration. Handle a temporary one-way master -- slave socket closure. --- Key: MESOS-1668 URL: https://issues.apache.org/jira/browse/MESOS-1668 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler Assignee: Vinod Kone Priority: Minor Labels: reliability In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs: → Master and Slave connected operating normally. → Temporary one-way network failure, master→slave link breaks. → Master marks slave as disconnected. → Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again. → Slave remains disconnected according to the master, and the slave does not try to re-register. Bad! We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.
[ https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1668: --- Placing this under reconciliation because, although extremely rare, it can lead to some inconsistent state between the master and slave for an arbitrary amount of time. For example, if the launchTask message is dropped as a result of the socket closure between Master → Slave in the scenario above. Handle a temporary one-way master -- slave socket closure. --- Key: MESOS-1668 URL: https://issues.apache.org/jira/browse/MESOS-1668 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler Priority: Minor Labels: reliability In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs: → Master and Slave connected operating normally. → Temporary one-way network failure, master→slave link breaks. → Master marks slave as disconnected. → Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again. → Slave remains disconnected according to the master, and the slave does not try to re-register. Bad! We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. -- This message was sent by Atlassian JIRA (v6.2#6252)