[ 
https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900930#comment-14900930
 ] 

Neil Conway commented on MESOS-3280:
------------------------------------

I've been looking into this. Current status:

* There definitely seems to be a bug in the auto-initialization logic of the 
replicated log implementation. I'm working to get a deterministic test case and 
a proposed fix.
* The auto-init bug actually doesn't have anything to do with network 
partitions: you can reproduce it by just starting three Mesos masters at the 
same time, although it depends on hitting a schedule of network messages. This 
is the cause of "Master fails to fetch the replicated log even before the 
network partition" issues noted above and in the Chronos ticket.
* As far as I can tell, Mesos behaves does not behave incorrectly during/after 
a network partition. Some of the "Recovery failed: Failed to recover registrar" 
errors that occur _after_ the partition has been healed seem to arise because 
the Jepsen scripts don't promptly restart a master that exits when it loses 
leadership.
* There is definitely some room for improvement in how we handle losing 
leadership: for example, because it is expected behavior, we shouldn't print a 
stack trace. It also seems like if the current leader is in the majority 
partition after a network partition, it still views this as a "leadership 
change" event and exits, even though that should be avoidable. TBD whether that 
behavior should be improved.

> Master fails to access replicated log after network partition
> -------------------------------------------------------------
>
>                 Key: MESOS-3280
>                 URL: https://issues.apache.org/jira/browse/MESOS-3280
>             Project: Mesos
>          Issue Type: Bug
>          Components: master, replicated log
>    Affects Versions: 0.23.0
>         Environment: Zookeeper version 3.4.5--1
>            Reporter: Bernd Mathiske
>            Assignee: Neil Conway
>              Labels: mesosphere
>
> In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a 
> network partition is forced, all the masters apparently lose access to their 
> replicated log. The leading master halts. Unknown reasons, but presumably 
> related to replicated log access. The others fail to recover from the 
> replicated log. Unknown reasons. This could have to do with ZK setup, but it 
> might also be a Mesos bug. 
> This was observed in a Chronos test drive scenario described in detail here:
> https://github.com/mesos/chronos/issues/511
> With setup instructions here:
> https://github.com/mesos/chronos/issues/508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to