[jira] [Commented] (MESOS-1529) Handle a network partition between Master and Slave

Vinod Kone (JIRA) Wed, 25 Jun 2014 13:04:24 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043969#comment-14043969
 ]


Vinod Kone commented on MESOS-1529:
-----------------------------------

{quote}
 It's not clear to me why (2) is required.
{quote}

This is mainly to speed up the re-registration instead of waiting for the 
timeout to elapse. This is useful in case the slave -> master link is broken 
but slave --> ZK is fine.

{quote}
Will (3) also check the ping is from the leading master and trigger 
re-registration if a ping is received from a different master?
{quote}

It will definitely count a ping as successful only if it is from the leading 
master. If it receives a ping from a non-leading master it means that the slave 
--> master link is broken while master --> slave link is fine. In this case a 
re-registration should already be in progress. If the ping was a delayed ping 
from an old master the slave should've already re-registered/re-registering 
with the new master.

{quote}
2) What does an "exit" event signify? Why would we need to check that it was 
for a leading master?
{quote}

An "exited" event signifies that a link between slave --> master is broken. 
This could be due to network partition or master failover. We need to check if 
it was from the leading master because, before "exited" event is received by 
the slave, the slave might have received a "new master detected" event from zk 
and re-registered with a new master. In that case, the slave can safely ignore 
the "exited" event.

{quote}
3) How is the 75 seconds determined?
{quote}

It is nice to be "greater than" 75s which is the timeout used by the master to 
remove a slave so that slave(s) don't overwhelm masters with re-registration 
attempts when master likely didn't even remove them. The greater the value the 
longer it will take for the master and slave to reconcile. We can make it 
configurable and let the operators choose.

{quote}
Does this lock us into a phased upgrade path if this timeout value needs to 
change?
{quote}

I don't see why it would lock us into an upgrade path.

{quote}
 If we get a ping from a non-leading master, we should likely ignore it and not 
immediately trigger re-registration. IE: let the timeout take effect.
{quote}

Yes we will ignore it. See above.

> Handle a network partition between Master and Slave
> ---------------------------------------------------
>
>                 Key: MESOS-1529
>                 URL: https://issues.apache.org/jira/browse/MESOS-1529
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Dominic Hamon
>
> If a network partition occurs between a Master and Slave, the Master will 
> remove the Slave (as it fails health check) and mark the tasks being run 
> there as LOST. However, the Slave is not aware that it has been removed so 
> the tasks will continue to run.
> (To clarify a little bit: neither the master nor the slave receives 'exited' 
> event, indicating that the connection between the master and slave is not 
> closed).
> There are at least two possible approaches to solving this issue:
> 1. Introduce a health check from Slave to Master so they have a consistent 
> view of a network partition. We may still see this issue should a one-way 
> connection error occur.
> 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
> Slave reappears and reconcile then. We'd still need to mark Slaves and tasks 
> as potentially lost (zombie state) but maybe the Scheduler can make a more 
> intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1529) Handle a network partition between Master and Slave

Reply via email to