In addition to what Dick said, you need to make sure that you have a quorum of masters *online* in order for a master to recover correctly. This means you'll want to run the master under a tool (e.g. Monit) that restarts it promptly upon failure.
You'll want to do this for the slaves as well. On Thu, Nov 6, 2014 at 11:36 PM, Dick Davies <d...@hellooperator.net> wrote: > Golden Rule : Don't use even numbers of members with quorum systems. > > You need a quorum to function so with 2 masters and quorum=2, you can't > ever take a member down. With 2 masters and quorum=1, you're asking > for "split brain". > > (this is exactly the same with zookeeper by the way, it's also a quorum > system) > > If you have 1 master, quorum=1 > if you have 3 masters, quorum=2 > if you have 5 masters, quorum=3 > > and so on. Try that and see if it helps. > > > On 7 November 2014 09:42, sujinzhao <sujinz...@gmail.com> wrote: > > In fact, I also tried with launching 2 masters on two separate machines, > at > > first, one of them was successfully elected as a leader, and both of them > > printed several lines of messages: > > > > Replica in EMPTY status received a broadcasted recover request > > Received a recover response from a replica in EMPTY status > > > > then the leader master aborted after outputing errors: > > > > Recovery failed: Failed to recover registrar: Failed to perform fetch > within > > 1mins > > *** Check failure stack trace: *** > > @ 0x7f3c1ea105cd google::LogMessage::Fail() > > .............................. > > > > and next, the second master became the new leader, it also tried to > recovery > > from the registrar, but also failed and printed errors before aborted: > > > > Recovery failed: Failed to recover registrar: Failed to perform fetch > within > > 1mins > > *** Check failure stack trace: *** > > @ 0x7f3c1ea105cd google::LogMessage::Fail() > > ............................... > > > > So I guess that's not problems of zookeeper, it's the elected leader can > not > > recover from registrar, could somebody be kind to illustrate some > principles > > of mesos registry, or give me some suggestions? > > > > THANKS. > > > > "david.j.palaitis" <david.j.palai...@gmail.com>编写: > > > > > > With a single master, you should not set quorum=2 > > > > > > -------- Original message -------- > > From: sujinzhao <sujinz...@gmail.com> > > Date:11/06/2014 4:01 PM (GMT-05:00) > > To: user@mesos.apache.org > > Cc: > > Subject: Problems of running mesos-0.20.0 with zookeeper > > > > Hi,all, > > > > I set up zookeeper service with three machines zoo1, zoo2, zoo3, and also > > installed 1 mesos master and 2 slaves on another three nodes, I tried to > run > > master and slaves with: > > ./mesos-master.sh --ip=master-ip > > --zk=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos --quorum=2 > > > > ./mesos-slave.sh --ip=slave-ip > > --master=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos > > > > I also created the /mesos znode before running the above commands, but I > got > > the following error: > > > > Recovering from registrar > > Recovering registrar > > Recovery failed: Failed to recover registrar: Failed to perform fetch > within > > 1mins > > *** Check failure stack trace: *** > > @ 0x7f3c1ea105cd google::LogMessage::Fail() > > ............................... > > > > after reading the master log, I found that before causing error, master > has > > already been elected successfully, but the leader failed in recovering > from > > registrar, so I guess this error has little relationship with zookeeper. > > > > after googleing I found that other people also encountered this problem, > but > > with no solution, I also exclude the possible reason of ssh between > > master/slave and zookeeper servers with no password. > > > > So, could somebody be kindly to tell me how to solve this error? any > > suggestions will be appreciated. > > > > THANKS. >