[ https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900930#comment-14900930 ]
Neil Conway commented on MESOS-3280: ------------------------------------ I've been looking into this. Current status: * There definitely seems to be a bug in the auto-initialization logic of the replicated log implementation. I'm working to get a deterministic test case and a proposed fix. * The auto-init bug actually doesn't have anything to do with network partitions: you can reproduce it by just starting three Mesos masters at the same time, although it depends on hitting a schedule of network messages. This is the cause of "Master fails to fetch the replicated log even before the network partition" issues noted above and in the Chronos ticket. * As far as I can tell, Mesos behaves does not behave incorrectly during/after a network partition. Some of the "Recovery failed: Failed to recover registrar" errors that occur _after_ the partition has been healed seem to arise because the Jepsen scripts don't promptly restart a master that exits when it loses leadership. * There is definitely some room for improvement in how we handle losing leadership: for example, because it is expected behavior, we shouldn't print a stack trace. It also seems like if the current leader is in the majority partition after a network partition, it still views this as a "leadership change" event and exits, even though that should be avoidable. TBD whether that behavior should be improved. > Master fails to access replicated log after network partition > ------------------------------------------------------------- > > Key: MESOS-3280 > URL: https://issues.apache.org/jira/browse/MESOS-3280 > Project: Mesos > Issue Type: Bug > Components: master, replicated log > Affects Versions: 0.23.0 > Environment: Zookeeper version 3.4.5--1 > Reporter: Bernd Mathiske > Assignee: Neil Conway > Labels: mesosphere > > In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a > network partition is forced, all the masters apparently lose access to their > replicated log. The leading master halts. Unknown reasons, but presumably > related to replicated log access. The others fail to recover from the > replicated log. Unknown reasons. This could have to do with ZK setup, but it > might also be a Mesos bug. > This was observed in a Chronos test drive scenario described in detail here: > https://github.com/mesos/chronos/issues/511 > With setup instructions here: > https://github.com/mesos/chronos/issues/508 -- This message was sent by Atlassian JIRA (v6.3.4#6332)