Re: Mesos 0.19 registrar upgrade

2014-07-23 Thread Tomas Barton
Ok, thanks Ben! In would be nice to update documentation accordingly.

So, in 0.20 there might be a flag specifying total number of masters?


On 23 July 2014 00:13, Benjamin Mahler benjamin.mah...@gmail.com wrote:

 At the current time, you need an odd number of masters as there is an
 assumption built into the replicated that the number of masters = 2*quorum
 - 1. This assumption is present when bootstrapping the log from no data.

 To recover from this, you need to run an odd number of masters, and set
 your quorum correctly. For example, 3 masters with quorum 2, or 5 masters
 with quorum 3. It is safe to wipe the replica logs before doing this.

 There are some outstanding tickets to clean this up:
 https://issues.apache.org/jira/browse/MESOS-1465
 https://issues.apache.org/jira/browse/MESOS-1546

 We'd like to have the configuration be explicit about the total number of
 masters, so that the assumption need not be made.


 On Tue, Jul 22, 2014 at 2:40 AM, Tomas Barton barton.to...@gmail.com
 wrote:

 Hi,

 what is the best way to upgrade Mesos cluster from 0.18 to 0.19? I've
 tried to read all documentation before doing actual upgrade, but I still
 don't understand a few things.

 What should be the quorum size?

 The --help says that It is imperative to set this value to be a majority
 of masters i.e., quorum  (number of masters)/2

 I have 4 Mesos masters, which would mean that quorum  2 - quorum=3,
 right?

 The recover.cpp says that: we allow a replica in EMPTY status to become
 VOTING immediately if it finds ALL (i.e., 2 * quorum - 1) replicas are in
 EMPTY status
 So, with quorum = 3 I would need 5 Mesos masters (that's just not clear
 from the mesos-master --help).

 quorum=1, mesos-masters=1
 quorum=2, mesos-masters=3
 quorum=3, mesos-masters=5
 quorum=4, mesos-masters=7

 Is is possible to have non-even number of Mesos masters? or is it just a
 bad idea?

 With 4 masters I got into a situation when:

 master 1:
 I0722 11:35:40.708562 12689 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request

 master 2:
 I0722 11:36:37.593647  7754 replica.cpp:638] Replica in EMPTY status
 received a broadcasted recover request

 master 3:
 I0722 11:35:14.102762 26701 recover.cpp:188] Received a recover response
 from a replica in STARTING status

 master 4:
 I0722 11:35:54.284169 32056 replica.cpp:638] Replica in STARTING status
 received a broadcasted recover request
 I0722 11:35:54.284425 32050 recover.cpp:188] Received a recover response
 from a replica in STARTING status
 I0722 11:35:54.284788 32057 recover.cpp:188] Received a recover response
 from a replica in VOTING status
 I0722 11:35:54.285127 32050 recover.cpp:188] Received a recover response
 from a replica in EMPTY status

 And the election algorithm ends up in an endless loop. How can I recover
 from this? Delete all replica logs from master disk? Start with quorum=1
 and increment number of masters?

 Thanks,
 Tomas





Mesos 0.19 registrar upgrade

2014-07-22 Thread Tomas Barton
Hi,

what is the best way to upgrade Mesos cluster from 0.18 to 0.19? I've tried
to read all documentation before doing actual upgrade, but I still don't
understand a few things.

What should be the quorum size?

The --help says that It is imperative to set this value to be a majority
of masters i.e., quorum  (number of masters)/2

I have 4 Mesos masters, which would mean that quorum  2 - quorum=3, right?

The recover.cpp says that: we allow a replica in EMPTY status to become
VOTING immediately if it finds ALL (i.e., 2 * quorum - 1) replicas are in
EMPTY status
So, with quorum = 3 I would need 5 Mesos masters (that's just not clear
from the mesos-master --help).

quorum=1, mesos-masters=1
quorum=2, mesos-masters=3
quorum=3, mesos-masters=5
quorum=4, mesos-masters=7

Is is possible to have non-even number of Mesos masters? or is it just a
bad idea?

With 4 masters I got into a situation when:

master 1:
I0722 11:35:40.708562 12689 replica.cpp:638] Replica in VOTING status
received a broadcasted recover request

master 2:
I0722 11:36:37.593647  7754 replica.cpp:638] Replica in EMPTY status
received a broadcasted recover request

master 3:
I0722 11:35:14.102762 26701 recover.cpp:188] Received a recover response
from a replica in STARTING status

master 4:
I0722 11:35:54.284169 32056 replica.cpp:638] Replica in STARTING status
received a broadcasted recover request
I0722 11:35:54.284425 32050 recover.cpp:188] Received a recover response
from a replica in STARTING status
I0722 11:35:54.284788 32057 recover.cpp:188] Received a recover response
from a replica in VOTING status
I0722 11:35:54.285127 32050 recover.cpp:188] Received a recover response
from a replica in EMPTY status

And the election algorithm ends up in an endless loop. How can I recover
from this? Delete all replica logs from master disk? Start with quorum=1
and increment number of masters?

Thanks,
Tomas


Re: Mesos 0.19 registrar upgrade

2014-07-22 Thread Dick Davies
On 22 July 2014 10:40, Tomas Barton barton.to...@gmail.com wrote:

 I have 4 Mesos masters, which would mean that quorum  2 - quorum=3, right?

Yes, that's right. 2 won't be enough.


 quorum=1, mesos-masters=1
 quorum=2, mesos-masters=3
 quorum=3, mesos-masters=5
 quorum=4, mesos-masters=7

 Is is possible to have non-even number of Mesos masters? or is it just a bad
 idea?

Yes, it's a bad idea since this change - it's always been a bad idea
to run an even
number of zookeepers and now that extends to the mesos masters.

4 masters gives you no extra redundancy over 3, and your likelihood of node loss
increases slightly (as you now have an extra server to potentially break).


Re: Mesos 0.19 registrar upgrade

2014-07-22 Thread Benjamin Mahler
At the current time, you need an odd number of masters as there is an
assumption built into the replicated that the number of masters = 2*quorum
- 1. This assumption is present when bootstrapping the log from no data.

To recover from this, you need to run an odd number of masters, and set
your quorum correctly. For example, 3 masters with quorum 2, or 5 masters
with quorum 3. It is safe to wipe the replica logs before doing this.

There are some outstanding tickets to clean this up:
https://issues.apache.org/jira/browse/MESOS-1465
https://issues.apache.org/jira/browse/MESOS-1546

We'd like to have the configuration be explicit about the total number of
masters, so that the assumption need not be made.


On Tue, Jul 22, 2014 at 2:40 AM, Tomas Barton barton.to...@gmail.com
wrote:

 Hi,

 what is the best way to upgrade Mesos cluster from 0.18 to 0.19? I've
 tried to read all documentation before doing actual upgrade, but I still
 don't understand a few things.

 What should be the quorum size?

 The --help says that It is imperative to set this value to be a majority
 of masters i.e., quorum  (number of masters)/2

 I have 4 Mesos masters, which would mean that quorum  2 - quorum=3,
 right?

 The recover.cpp says that: we allow a replica in EMPTY status to become
 VOTING immediately if it finds ALL (i.e., 2 * quorum - 1) replicas are in
 EMPTY status
 So, with quorum = 3 I would need 5 Mesos masters (that's just not clear
 from the mesos-master --help).

 quorum=1, mesos-masters=1
 quorum=2, mesos-masters=3
 quorum=3, mesos-masters=5
 quorum=4, mesos-masters=7

 Is is possible to have non-even number of Mesos masters? or is it just a
 bad idea?

 With 4 masters I got into a situation when:

 master 1:
 I0722 11:35:40.708562 12689 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request

 master 2:
 I0722 11:36:37.593647  7754 replica.cpp:638] Replica in EMPTY status
 received a broadcasted recover request

 master 3:
 I0722 11:35:14.102762 26701 recover.cpp:188] Received a recover response
 from a replica in STARTING status

 master 4:
 I0722 11:35:54.284169 32056 replica.cpp:638] Replica in STARTING status
 received a broadcasted recover request
 I0722 11:35:54.284425 32050 recover.cpp:188] Received a recover response
 from a replica in STARTING status
 I0722 11:35:54.284788 32057 recover.cpp:188] Received a recover response
 from a replica in VOTING status
 I0722 11:35:54.285127 32050 recover.cpp:188] Received a recover response
 from a replica in EMPTY status

 And the election algorithm ends up in an endless loop. How can I recover
 from this? Delete all replica logs from master disk? Start with quorum=1
 and increment number of masters?

 Thanks,
 Tomas