[jira] [Commented] (MESOS-3280) Master fails to access replicated log after network partition

2015-10-14 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957482#comment-14957482
 ] 

Neil Conway commented on MESOS-3280:


Fix for the race condition is here: https://reviews.apache.org/r/39325/

Note that the testing mock needs to be rethought (working on how to do this 
properly), and a few details need discuss (e.g., whether to use a backoff when 
retrying a failed coordinator election).

> Master fails to access replicated log after network partition
> -
>
> Key: MESOS-3280
> URL: https://issues.apache.org/jira/browse/MESOS-3280
> Project: Mesos
>  Issue Type: Bug
>  Components: master, replicated log
>Affects Versions: 0.23.0
> Environment: Zookeeper version 3.4.5--1
>Reporter: Bernd Mathiske
>Assignee: Neil Conway
>  Labels: mesosphere
> Attachments: rep-log-race-cond-logs.tar.gz, 
> rep-log-startup-race-test-1.patch
>
>
> In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a 
> network partition is forced, all the masters apparently lose access to their 
> replicated log. The leading master halts. Unknown reasons, but presumably 
> related to replicated log access. The others fail to recover from the 
> replicated log. Unknown reasons. This could have to do with ZK setup, but it 
> might also be a Mesos bug. 
> This was observed in a Chronos test drive scenario described in detail here:
> https://github.com/mesos/chronos/issues/511
> With setup instructions here:
> https://github.com/mesos/chronos/issues/508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3280) Master fails to access replicated log after network partition

2015-09-26 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909536#comment-14909536
 ] 

Neil Conway commented on MESOS-3280:


To followup on the bug in the auto-initialization code (first bullet above), 
there's a race condition between the log recovery (auto-initialization) 
protocol and the coordinator election protocol:

* to elect the coordinator, we try to pass an implicit promise (note that 
there's no retry mechanism)
* to recover the log, we do a two-phase broadcast (see RecoverProtocolProcess), 
where each node goes from EMPTY => STARTING => VOTING
* if a node in EMPTY or STARTING state receives a promise request, it silently 
ignores it

Moreover, AFAICS there is no synchronization between starting the log recovery 
protocol and doing coordinator election: coordinator election happens as soon 
as we detect we're the Zk leader (RegistrarProcess::recover(), which calls 
LogWriterProcess::start(), which tries to be elected as the coordinator), 
whereas log recovery/auto-init actually starts earlier (in main() in 
master/main.cpp). We wait on the `recovering` promise *locally* before starting 
coordinator election at the Zk leader, but that doesn't mean that log recovery 
has finished at a quorum of other nodes.

I'll attach a patch with a test case that makes the race condition more likely 
by having 2 of the 3 nodes sleep before transitioning from STARTING => VOTING. 
I'll also attach a log of an execution that shows the problem; note that you 
need to annotate the replicated log code with a bunch of extra LOG()s to see 
when messages are ignored (this could also be improved).

There's a few different ways we can fix the problem: e.g., by adding a retry to 
the coordinator election protocol, or by ensuring we have a quorum of VOTING 
nodes before trying to elect a coordinator (the latter approach seems like it 
would be quite racy, though). I'll propose a fix shortly.

Note that there's another possible problem here: depending on the order in 
which messages in the log recovery protocol are observed, a node might actually 
transition from EMPTY => STARTING => RECOVERING, at which point it will do the 
catchup protocol. Per talking with [~jieyu], this seems unexpected, and may be 
problematic. I haven't found a reproducible test case yet, but I'll followup 
with Jie.

> Master fails to access replicated log after network partition
> -
>
> Key: MESOS-3280
> URL: https://issues.apache.org/jira/browse/MESOS-3280
> Project: Mesos
>  Issue Type: Bug
>  Components: master, replicated log
>Affects Versions: 0.23.0
> Environment: Zookeeper version 3.4.5--1
>Reporter: Bernd Mathiske
>Assignee: Neil Conway
>  Labels: mesosphere
> Attachments: rep-log-startup-race-test-1.patch
>
>
> In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a 
> network partition is forced, all the masters apparently lose access to their 
> replicated log. The leading master halts. Unknown reasons, but presumably 
> related to replicated log access. The others fail to recover from the 
> replicated log. Unknown reasons. This could have to do with ZK setup, but it 
> might also be a Mesos bug. 
> This was observed in a Chronos test drive scenario described in detail here:
> https://github.com/mesos/chronos/issues/511
> With setup instructions here:
> https://github.com/mesos/chronos/issues/508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3280) Master fails to access replicated log after network partition

2015-09-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900930#comment-14900930
 ] 

Neil Conway commented on MESOS-3280:


I've been looking into this. Current status:

* There definitely seems to be a bug in the auto-initialization logic of the 
replicated log implementation. I'm working to get a deterministic test case and 
a proposed fix.
* The auto-init bug actually doesn't have anything to do with network 
partitions: you can reproduce it by just starting three Mesos masters at the 
same time, although it depends on hitting a schedule of network messages. This 
is the cause of "Master fails to fetch the replicated log even before the 
network partition" issues noted above and in the Chronos ticket.
* As far as I can tell, Mesos behaves does not behave incorrectly during/after 
a network partition. Some of the "Recovery failed: Failed to recover registrar" 
errors that occur _after_ the partition has been healed seem to arise because 
the Jepsen scripts don't promptly restart a master that exits when it loses 
leadership.
* There is definitely some room for improvement in how we handle losing 
leadership: for example, because it is expected behavior, we shouldn't print a 
stack trace. It also seems like if the current leader is in the majority 
partition after a network partition, it still views this as a "leadership 
change" event and exits, even though that should be avoidable. TBD whether that 
behavior should be improved.

> Master fails to access replicated log after network partition
> -
>
> Key: MESOS-3280
> URL: https://issues.apache.org/jira/browse/MESOS-3280
> Project: Mesos
>  Issue Type: Bug
>  Components: master, replicated log
>Affects Versions: 0.23.0
> Environment: Zookeeper version 3.4.5--1
>Reporter: Bernd Mathiske
>Assignee: Neil Conway
>  Labels: mesosphere
>
> In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a 
> network partition is forced, all the masters apparently lose access to their 
> replicated log. The leading master halts. Unknown reasons, but presumably 
> related to replicated log access. The others fail to recover from the 
> replicated log. Unknown reasons. This could have to do with ZK setup, but it 
> might also be a Mesos bug. 
> This was observed in a Chronos test drive scenario described in detail here:
> https://github.com/mesos/chronos/issues/511
> With setup instructions here:
> https://github.com/mesos/chronos/issues/508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3280) Master fails to access replicated log after network partition

2015-09-17 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804028#comment-14804028
 ] 

Jie Yu commented on MESOS-3280:
---

I'd be happy to assist as well. Will be useful to attach the master's log 
(related to replicated log).

> Master fails to access replicated log after network partition
> -
>
> Key: MESOS-3280
> URL: https://issues.apache.org/jira/browse/MESOS-3280
> Project: Mesos
>  Issue Type: Bug
>  Components: master, replicated log
>Affects Versions: 0.23.0
> Environment: Zookeeper version 3.4.5--1
>Reporter: Bernd Mathiske
>Assignee: Neil Conway
>  Labels: mesosphere
>
> In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a 
> network partition is forced, all the masters apparently lose access to their 
> replicated log. The leading master halts. Unknown reasons, but presumably 
> related to replicated log access. The others fail to recover from the 
> replicated log. Unknown reasons. This could have to do with ZK setup, but it 
> might also be a Mesos bug. 
> This was observed in a Chronos test drive scenario described in detail here:
> https://github.com/mesos/chronos/issues/511
> With setup instructions here:
> https://github.com/mesos/chronos/issues/508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3280) Master fails to access replicated log after network partition

2015-08-18 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701679#comment-14701679
 ] 

haosdent commented on MESOS-3280:
-

The details about how aphyr test this is this article [Call me Maybe: 
Chronos](https://aphyr.com/posts/326-call-me-maybe-chronos) It is on HN news 
first page now. And other user asked in stackoverflow seems also because of 
this bug. [mesos-master crash with zookeeper 
cluster](http://stackoverflow.com/questions/32044884/mesos-master-crash-with-zookeeper-cluster)
 

> Master fails to access replicated log after network partition
> -
>
> Key: MESOS-3280
> URL: https://issues.apache.org/jira/browse/MESOS-3280
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.23.0
> Environment: Zookeeper version 3.4.5--1
>Reporter: Bernd Mathiske
>  Labels: mesosphere
>
> In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a 
> network partition is forced, all the masters apparently lose access to their 
> replicated log. The leading master halts. Unknown reasons, but presumably 
> related to replicated log access. The others fail to recover from the 
> replicated log. Unknown reasons. This could have to do with ZK setup, but it 
> might also be a Mesos bug. 
> This was observed in a Chronos test drive scenario described in detail here:
> https://github.com/mesos/chronos/issues/511
> With setup instructions here:
> https://github.com/mesos/chronos/issues/508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3280) Master fails to access replicated log after network partition

2015-08-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701016#comment-14701016
 ] 

Gastón Kleiman commented on MESOS-3280:
---

Another set of logs were added to the Chronos issue: 
https://github.com/mesos/chronos/issues/511#issuecomment-131993588

In this new case, the initial Mesos Master fails to fetch the replicated log 
even before the network partition.

> Master fails to access replicated log after network partition
> -
>
> Key: MESOS-3280
> URL: https://issues.apache.org/jira/browse/MESOS-3280
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.23.0
> Environment: Zookeeper version 3.4.5--1
>Reporter: Bernd Mathiske
>  Labels: mesosphere
>
> In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a 
> network partition is forced, all the masters apparently lose access to their 
> replicated log. The leading master halts. Unknown reasons, but presumably 
> related to replicated log access. The others fail to recover from the 
> replicated log. Unknown reasons. This could have to do with ZK setup, but it 
> might also be a Mesos bug. 
> This was observed in a Chronos test drive scenario described in detail here:
> https://github.com/mesos/chronos/issues/511
> With setup instructions here:
> https://github.com/mesos/chronos/issues/508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)