[jira] [Updated] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.

2014-09-22 Thread Dominic Hamon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1668:
-
Sprint: Mesos Q3 Sprint 5, Mesos Q3 Sprint 6  (was: Mesos Q3 Sprint 5)

 Handle a temporary one-way master -- slave socket closure.
 ---

 Key: MESOS-1668
 URL: https://issues.apache.org/jira/browse/MESOS-1668
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler
Assignee: Vinod Kone
Priority: Minor
  Labels: reliability

 In MESOS-1529, we realized that it's possible for a slave to remain 
 disconnected in the master if the following occurs:
 → Master and Slave connected operating normally.
 → Temporary one-way network failure, master→slave link breaks.
 → Master marks slave as disconnected.
 → Network restored and health checking continues normally, slave is not 
 removed as a result. Slave does not attempt to re-register since it is 
 receiving pings once again.
 → Slave remains disconnected according to the master, and the slave does not 
 try to re-register. Bad!
 We were originally thinking of using a failover timeout in the master to 
 remove these slaves that don't re-register. However, it can be dangerous when 
 ZooKeeper issues are preventing the slave from re-registering with the 
 master; we do not want to remove a ton of slaves in this situation.
 Rather, when the slave is health checking correctly but does not re-register 
 within a timeout, we could send a registration request from the master to the 
 slave, telling the slave that it must re-register. This message could also be 
 used when receiving status updates (or other messages) from slaves that are 
 disconnected in the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.

2014-09-19 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1668:
--
Shepherd: Benjamin Mahler

https://reviews.apache.org/r/25867/

 Handle a temporary one-way master -- slave socket closure.
 ---

 Key: MESOS-1668
 URL: https://issues.apache.org/jira/browse/MESOS-1668
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler
Assignee: Vinod Kone
Priority: Minor
  Labels: reliability

 In MESOS-1529, we realized that it's possible for a slave to remain 
 disconnected in the master if the following occurs:
 → Master and Slave connected operating normally.
 → Temporary one-way network failure, master→slave link breaks.
 → Master marks slave as disconnected.
 → Network restored and health checking continues normally, slave is not 
 removed as a result. Slave does not attempt to re-register since it is 
 receiving pings once again.
 → Slave remains disconnected according to the master, and the slave does not 
 try to re-register. Bad!
 We were originally thinking of using a failover timeout in the master to 
 remove these slaves that don't re-register. However, it can be dangerous when 
 ZooKeeper issues are preventing the slave from re-registering with the 
 master; we do not want to remove a ton of slaves in this situation.
 Rather, when the slave is health checking correctly but does not re-register 
 within a timeout, we could send a registration request from the master to the 
 slave, telling the slave that it must re-register. This message could also be 
 used when receiving status updates (or other messages) from slaves that are 
 disconnected in the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.

2014-09-11 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1668:
--
  Sprint: Mesos Q3 Sprint 5
Assignee: Vinod Kone
Story Points: 2

The plan is to handle this by piggybacking the current slave state (e.g., bool 
registered) on the ping/pong messages.

When the slave receives a ping message which says that the master thinks the 
slave is disconnected but slave doesn't know it yet (socket only broke on the 
master side), slave will attempt a re-registration.

 Handle a temporary one-way master -- slave socket closure.
 ---

 Key: MESOS-1668
 URL: https://issues.apache.org/jira/browse/MESOS-1668
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler
Assignee: Vinod Kone
Priority: Minor
  Labels: reliability

 In MESOS-1529, we realized that it's possible for a slave to remain 
 disconnected in the master if the following occurs:
 → Master and Slave connected operating normally.
 → Temporary one-way network failure, master→slave link breaks.
 → Master marks slave as disconnected.
 → Network restored and health checking continues normally, slave is not 
 removed as a result. Slave does not attempt to re-register since it is 
 receiving pings once again.
 → Slave remains disconnected according to the master, and the slave does not 
 try to re-register. Bad!
 We were originally thinking of using a failover timeout in the master to 
 remove these slaves that don't re-register. However, it can be dangerous when 
 ZooKeeper issues are preventing the slave from re-registering with the 
 master; we do not want to remove a ton of slaves in this situation.
 Rather, when the slave is health checking correctly but does not re-register 
 within a timeout, we could send a registration request from the master to the 
 slave, telling the slave that it must re-register. This message could also be 
 used when receiving status updates (or other messages) from slaves that are 
 disconnected in the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1668:
---


Placing this under reconciliation because, although extremely rare, it can lead 
to some inconsistent state between the master and slave for an arbitrary amount 
of time. For example, if the launchTask message is dropped as a result of the 
socket closure between Master → Slave in the scenario above.

 Handle a temporary one-way master -- slave socket closure.
 ---

 Key: MESOS-1668
 URL: https://issues.apache.org/jira/browse/MESOS-1668
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler
Priority: Minor
  Labels: reliability

 In MESOS-1529, we realized that it's possible for a slave to remain 
 disconnected in the master if the following occurs:
 → Master and Slave connected operating normally.
 → Temporary one-way network failure, master→slave link breaks.
 → Master marks slave as disconnected.
 → Network restored and health checking continues normally, slave is not 
 removed as a result. Slave does not attempt to re-register since it is 
 receiving pings once again.
 → Slave remains disconnected according to the master, and the slave does not 
 try to re-register. Bad!
 We were originally thinking of using a failover timeout in the master to 
 remove these slaves that don't re-register. However, it can be dangerous when 
 ZooKeeper issues are preventing the slave from re-registering with the 
 master; we do not want to remove a ton of slaves in this situation.
 Rather, when the slave is health checking correctly but does not re-register 
 within a timeout, we could send a registration request from the master to the 
 slave, telling the slave that it must re-register. This message could also be 
 used when receiving status updates (or other messages) from slaves that are 
 disconnected in the master.



--
This message was sent by Atlassian JIRA
(v6.2#6252)