In the logs, there 2 monitors are constantly reporting that they won the leader election:
60z0m02 (monitor 0): 2016-07-25 14:31:11.644335 7f8760af7700 0 log_channel(cluster) log [INF] : mon.60z0m02@0 won leader election with quorum 0,2,4 2016-07-25 14:31:44.521552 7f8760af7700 1 mon.60z0m02@0(leader).paxos(paxos recovering c 1318755..1319320) collect timeout, calling fresh election 60zxl02 (monitor 1): 2016-07-25 14:31:59.542346 7fefdeaed700 1 mon.60zxl02@1(electing).elector(11441) init, last seen epoch 11441 2016-07-25 14:32:04.583929 7fefdf4ee700 0 log_channel(cluster) log [INF] : mon.60zxl02@1 won leader election with quorum 1,2,4 2016-07-25 14:32:33.440103 7fefdf4ee700 1 mon.60zxl02@1(leader).paxos(paxos recovering c 1318755..1319319) collect timeout, calling fresh election On Mon, Jul 25, 2016 at 3:27 PM, Sergio A. de Carvalho Jr. < scarvalh...@gmail.com> wrote: > Hi, > > I have a cluster of 5 hosts running Ceph 0.94.6 on CentOS 6.5. On each > host, there is 1 monitor and 13 OSDs. We had an issue with the network and > for some reason (which I still don't know why), the servers were restarted. > One host is still down, but the monitors on the 4 remaining servers are > failing to enter a quorum. > > I managed to get a quorum of 3 monitors by stopping all Ceph monitors and > OSDs across all machines, and bringing up the top 3 ranked monitors in > order of rank. After a few minutes, the 60z0m02 monitor (the top ranked > one) became the leader: > > { > "name": "60z0m02", > "rank": 0, > "state": "leader", > "election_epoch": 11328, > "quorum": [ > 0, > 1, > 2 > ], > "outside_quorum": [], > "extra_probe_peers": [], > "sync_provider": [], > "monmap": { > "epoch": 5, > "fsid": "2f51a247-3155-4bcf-9aee-c6f6b2c5e2af", > "modified": "2016-04-28 22:26:48.604393", > "created": "0.000000", > "mons": [ > { > "rank": 0, > "name": "60z0m02", > "addr": "10.98.2.166:6789\/0" > }, > { > "rank": 1, > "name": "60zxl02", > "addr": "10.98.2.167:6789\/0" > }, > { > "rank": 2, > "name": "610wl02", > "addr": "10.98.2.173:6789\/0" > }, > { > "rank": 3, > "name": "618yl02", > "addr": "10.98.2.214:6789\/0" > }, > { > "rank": 4, > "name": "615yl02", > "addr": "10.98.2.216:6789\/0" > } > ] > } > } > > The other 2 monitors became peons: > > "name": "60zxl02", > "rank": 1, > "state": "peon", > "election_epoch": 11328, > "quorum": [ > 0, > 1, > 2 > ], > > "name": "610wl02", > "rank": 2, > "state": "peon", > "election_epoch": 11328, > "quorum": [ > 0, > 1, > 2 > ], > > I then proceeded to start the fourth monitor, 615yl02 (618yl02 is powered > off), but after more than 2 hours and several election rounds, the monitors > still haven't reached a quorum. The monitors alternate mostly between > "election", "probing" states but they often seem to be in different > election epochs. > > Is this normal? > > Is there anything I can do to help the monitors elect a leader? Should I > manually remove the dead host's monitor from the monitor map? > > I left all OSD daemons stopped while the election is going on purpose. Is > this the best thing to do? Would bringing the OSDs up help or complicate > matters even more? Or doesn't it make any difference? > > I don't see anything obviously wrong in the monitor logs. They're mostly > filled with messages like the following: > > 2016-07-25 14:17:57.806148 7fc1b3f7e700 1 > mon.610wl02@2(electing).elector(11411) > init, last seen epoch 11411 > 2016-07-25 14:17:57.829198 7fc1b7caf700 0 log_channel(audit) log [DBG] : > from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch > 2016-07-25 14:17:57.829200 7fc1b7caf700 0 log_channel(audit) do_log log > to syslog > 2016-07-25 14:17:57.829254 7fc1b7caf700 0 log_channel(audit) log [DBG] : > from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished > > Any help would be hugely appreciated. > > Thanks, > > Sergio > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com