[ https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044782#comment-14044782 ]
Tobias Weingartner commented on MESOS-1529: ------------------------------------------- Reading point #3 above, I believe you mean "<=". Otherwise you could wait forever for a ping that will arrive at some point in the future. :) I think in the end, the most robust solution will be for the master to not be responsible for initiating/opening any connections to frameworks and/or slaves. If we do this, then staying connected would be the slave's (framework's) responsibility. For example, using the "HTTP CONNECT" method, a slave could request direct access to a master's particular pid endpoint, something like: {noformat} CONNECT pid1@master HTTP/1.0 Content-Transfer-Encoding: application/x-mesos-protobuf-v1 Authorization: token="...", ... {noformat} With the server responding with (only during connection): {noformat} HTTP/1.1 200 Connection established X-Welcome-Message: Welcome to the cloud {noformat} At this point, the connection moves to a pure binary TCP connection, which the master can now use to send protobuf over tcp requests to, including ping/pong, etc. If multiple pid endpoints are required, then their endpoints could possibly be multiplexed over this single link. Instead of connecting directly to a particular pid, you could connect to a mux pid, and the messages would then be shunted to the correct pids. Not sure if this makes any sense. Anyways, I gather this would be a rather large re-write, and changing protocols in a live system is... well, "interesting". Note: rfc-6455 might be another option, albeit much more involved... > Handle a network partition between Master and Slave > --------------------------------------------------- > > Key: MESOS-1529 > URL: https://issues.apache.org/jira/browse/MESOS-1529 > Project: Mesos > Issue Type: Bug > Reporter: Dominic Hamon > > If a network partition occurs between a Master and Slave, the Master will > remove the Slave (as it fails health check) and mark the tasks being run > there as LOST. However, the Slave is not aware that it has been removed so > the tasks will continue to run. > (To clarify a little bit: neither the master nor the slave receives 'exited' > event, indicating that the connection between the master and slave is not > closed). > There are at least two possible approaches to solving this issue: > 1. Introduce a health check from Slave to Master so they have a consistent > view of a network partition. We may still see this issue should a one-way > connection error occur. > 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the > Slave reappears and reconcile then. We'd still need to mark Slaves and tasks > as potentially lost (zombie state) but maybe the Scheduler can make a more > intelligent decision. -- This message was sent by Atlassian JIRA (v6.2#6252)