[ 
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044782#comment-14044782
 ] 

Tobias Weingartner commented on MESOS-1529:
-------------------------------------------

Reading point #3 above, I believe you mean "<=".  Otherwise you could wait 
forever for a ping that will arrive at some point in the future.  :)

I think in the end, the most robust solution will be for the master to not be 
responsible for initiating/opening any connections to frameworks and/or slaves. 
 If we do this, then staying connected would be the slave's (framework's) 
responsibility.

For example, using the "HTTP CONNECT" method, a slave could request direct 
access to a master's particular pid endpoint, something like:
{noformat}
CONNECT pid1@master HTTP/1.0
Content-Transfer-Encoding: application/x-mesos-protobuf-v1
Authorization: token="...", ...

{noformat}

With the server responding with (only during connection):
{noformat}
HTTP/1.1 200 Connection established
X-Welcome-Message: Welcome to the cloud

{noformat}

At this point, the connection moves to a pure binary TCP connection, which the 
master can now use to send protobuf over tcp requests to, including ping/pong, 
etc.  If multiple pid endpoints are required, then their endpoints could 
possibly be multiplexed over this single link.  Instead of connecting directly 
to a particular pid, you could connect to a mux pid, and the messages would 
then be shunted to the correct pids.  Not sure if this makes any sense.

Anyways, I gather this would be a rather large re-write, and changing protocols 
in a live system is... well, "interesting".
Note: rfc-6455 might be another option, albeit much more involved...

> Handle a network partition between Master and Slave
> ---------------------------------------------------
>
>                 Key: MESOS-1529
>                 URL: https://issues.apache.org/jira/browse/MESOS-1529
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Dominic Hamon
>
> If a network partition occurs between a Master and Slave, the Master will 
> remove the Slave (as it fails health check) and mark the tasks being run 
> there as LOST. However, the Slave is not aware that it has been removed so 
> the tasks will continue to run.
> (To clarify a little bit: neither the master nor the slave receives 'exited' 
> event, indicating that the connection between the master and slave is not 
> closed).
> There are at least two possible approaches to solving this issue:
> 1. Introduce a health check from Slave to Master so they have a consistent 
> view of a network partition. We may still see this issue should a one-way 
> connection error occur.
> 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
> Slave reappears and reconcile then. We'd still need to mark Slaves and tasks 
> as potentially lost (zombie state) but maybe the Scheduler can make a more 
> intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to