[ 
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043985#comment-14043985
 ] 

Tobias Weingartner commented on MESOS-1529:
-------------------------------------------

{quote}
An "exited" event signifies that a link between slave --> master is broken. 
This could be due to network partition or master failover. We need to check if 
it was from the leading master because, before "exited" event is received by 
the slave, the slave might have received a "new master detected" event from zk 
and re-registered with a new master. In that case, the slave can safely ignore 
the "exited" event.
{quote}
This sounds like it would be a race.  In the face of possibly having multiple 
masters connected to a slave, and master fail-over happening.

{quote}
 | Does this lock us into a phased upgrade path if this timeout value needs to 
change?
I don't see why it would lock us into an upgrade path.
{quote}
What I meant here, was if the operator decided that a 75s delay was too long, 
or too short, and needed to be changed in a running cluster.  At this point, it 
looks like the deploy of this change would be more involved, possibly requiring 
the coordination of thousands of machines.  If the option is not surfaced to 
the operator (no flags/etc), then if/when this single static number changes 
(adaptive based on the number of slaves, etc), then the modification of this 
will likely require a lot of planning and prep.

I see this as having a constant in two places without one informing the other 
what the constant should be.  When it changes in one (say a "new" master 
release is going to go with 150s pings due to load issues, if the masters roll 
before all the slaves have rolled to the new code, they'll end up flapping, 
etc), it can have a detrimental effect on the rest of the system.

> Handle a network partition between Master and Slave
> ---------------------------------------------------
>
>                 Key: MESOS-1529
>                 URL: https://issues.apache.org/jira/browse/MESOS-1529
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Dominic Hamon
>
> If a network partition occurs between a Master and Slave, the Master will 
> remove the Slave (as it fails health check) and mark the tasks being run 
> there as LOST. However, the Slave is not aware that it has been removed so 
> the tasks will continue to run.
> (To clarify a little bit: neither the master nor the slave receives 'exited' 
> event, indicating that the connection between the master and slave is not 
> closed).
> There are at least two possible approaches to solving this issue:
> 1. Introduce a health check from Slave to Master so they have a consistent 
> view of a network partition. We may still see this issue should a one-way 
> connection error occur.
> 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
> Slave reappears and reconcile then. We'd still need to mark Slaves and tasks 
> as potentially lost (zombie state) but maybe the Scheduler can make a more 
> intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to