In the logs, there 2 monitors are constantly reporting that they won the
leader election:

60z0m02 (monitor 0):
2016-07-25 14:31:11.644335 7f8760af7700  0 log_channel(cluster) log [INF] :
mon.60z0m02@0 won leader election with quorum 0,2,4
2016-07-25 14:31:44.521552 7f8760af7700  1 mon.60z0m02@0(leader).paxos(paxos
recovering c 1318755..1319320) collect timeout, calling fresh election

60zxl02 (monitor 1):
2016-07-25 14:31:59.542346 7fefdeaed700  1
mon.60zxl02@1(electing).elector(11441)
init, last seen epoch 11441
2016-07-25 14:32:04.583929 7fefdf4ee700  0 log_channel(cluster) log [INF] :
mon.60zxl02@1 won leader election with quorum 1,2,4
2016-07-25 14:32:33.440103 7fefdf4ee700  1 mon.60zxl02@1(leader).paxos(paxos
recovering c 1318755..1319319) collect timeout, calling fresh election


On Mon, Jul 25, 2016 at 3:27 PM, Sergio A. de Carvalho Jr. <
scarvalh...@gmail.com> wrote:

> Hi,
>
> I have a cluster of 5 hosts running Ceph 0.94.6 on CentOS 6.5. On each
> host, there is 1 monitor and 13 OSDs. We had an issue with the network and
> for some reason (which I still don't know why), the servers were restarted.
> One host is still down, but the monitors on the 4 remaining servers are
> failing to enter a quorum.
>
> I managed to get a quorum of 3 monitors by stopping all Ceph monitors and
> OSDs across all machines, and bringing up the top 3 ranked monitors in
> order of rank. After a few minutes, the 60z0m02 monitor (the top ranked
> one) became the leader:
>
> {
>     "name": "60z0m02",
>     "rank": 0,
>     "state": "leader",
>     "election_epoch": 11328,
>     "quorum": [
>         0,
>         1,
>         2
>     ],
>     "outside_quorum": [],
>     "extra_probe_peers": [],
>     "sync_provider": [],
>     "monmap": {
>         "epoch": 5,
>         "fsid": "2f51a247-3155-4bcf-9aee-c6f6b2c5e2af",
>         "modified": "2016-04-28 22:26:48.604393",
>         "created": "0.000000",
>         "mons": [
>             {
>                 "rank": 0,
>                 "name": "60z0m02",
>                 "addr": "10.98.2.166:6789\/0"
>             },
>             {
>                 "rank": 1,
>                 "name": "60zxl02",
>                 "addr": "10.98.2.167:6789\/0"
>             },
>             {
>                 "rank": 2,
>                 "name": "610wl02",
>                 "addr": "10.98.2.173:6789\/0"
>             },
>             {
>                 "rank": 3,
>                 "name": "618yl02",
>                 "addr": "10.98.2.214:6789\/0"
>             },
>             {
>                 "rank": 4,
>                 "name": "615yl02",
>                 "addr": "10.98.2.216:6789\/0"
>             }
>         ]
>     }
> }
>
> The other 2 monitors became peons:
>
> "name": "60zxl02",
>     "rank": 1,
>     "state": "peon",
>     "election_epoch": 11328,
>     "quorum": [
>         0,
>         1,
>         2
>     ],
>
> "name": "610wl02",
>     "rank": 2,
>     "state": "peon",
>     "election_epoch": 11328,
>     "quorum": [
>         0,
>         1,
>         2
>     ],
>
> I then proceeded to start the fourth monitor, 615yl02 (618yl02 is powered
> off), but after more than 2 hours and several election rounds, the monitors
> still haven't reached a quorum. The monitors alternate mostly between
> "election", "probing" states but they often seem to be in different
> election epochs.
>
> Is this normal?
>
> Is there anything I can do to help the monitors elect a leader? Should I
> manually remove the dead host's monitor from the monitor map?
>
> I left all OSD daemons stopped while the election is going on purpose. Is
> this the best thing to do? Would bringing the OSDs up help or complicate
> matters even more? Or doesn't it make any difference?
>
> I don't see anything obviously wrong in the monitor logs. They're mostly
> filled with messages like the following:
>
> 2016-07-25 14:17:57.806148 7fc1b3f7e700  1 
> mon.610wl02@2(electing).elector(11411)
> init, last seen epoch 11411
> 2016-07-25 14:17:57.829198 7fc1b7caf700  0 log_channel(audit) log [DBG] :
> from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
> 2016-07-25 14:17:57.829200 7fc1b7caf700  0 log_channel(audit) do_log log
> to syslog
> 2016-07-25 14:17:57.829254 7fc1b7caf700  0 log_channel(audit) log [DBG] :
> from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
>
> Any help would be hugely appreciated.
>
> Thanks,
>
> Sergio
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to