(Just realised I originally replied to Sean directly, so reposting here for posterity).
Bingo! I turned off syslog and the monitors quickly reached quorum and everything seems back to normal. Thanks so much, Sean. Luckily this is a test cluster. I wonder how I could catch this in a production cluster before our support engineers spend a day trying to track the problem down. Any ideas? On Tue, Jul 26, 2016 at 12:28 PM, Sean Crosby <richardnixonsh...@gmail.com> wrote: > Hi Sergio, > > You don't happen to have rsyslog forwarding to a central log server by any > chance? I've seen this behaviour before when my central log server is not > keeping up with messages. > > Cheers, > Sean > > On 26 July 2016 at 21:13, Sergio A. de Carvalho Jr. <scarvalh...@gmail.com > > wrote: > >> I left the 4 nodes running overnight and they just crawled to their >> knees... to the point that nothing has been written to the logs in the last >> 11 hours. So I stopped all monitors this morning and started them one by >> one again, but they're are still being extremely slow. Here are their logs: >> >> <https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406> >> <https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406> >> <https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406> >> https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406 >> https://gist.github.com/anonymous/f30a8903e701423825fd4d5aaa651e6a >> https://gist.github.com/anonymous/42a1856cc819de5b110d9f887e9859d2 >> https://gist.github.com/anonymous/652bc41197e83a9d76cf5b2e6a211aa2 >> >> I'm still puzzled to see logs being written with a timestamp that is >> several minutes behind the system clock. As time passes, the gap widens and >> quickly the logs are over 10 minutes behind the actual time, which explains >> why the logs above don't seem to overlap. >> >> >> >> On Mon, Jul 25, 2016 at 9:37 PM, Sergio A. de Carvalho Jr. < >> scarvalh...@gmail.com> wrote: >> >>> Awesome, thanks so much, Joao. >>> >>> Here's the mon_status: >>> https://gist.github.com/anonymous/2b80a9a75d134d9e539dfbc81615c055 >>> >>> I'm still trying to collect the logs, but while doing that I noticed >>> that the log records are severely delayed compared to the system clock. For >>> example, watching the logs with tail -f, I see records with a timestamp >>> that is up to 28 minutes behind the system clock! >>> >>> Also, while trying to set debug level, the monitors sometimes hung for >>> several minutes, so there's obviously something wrong with them. >>> >>> >>> On Mon, Jul 25, 2016 at 6:16 PM, Joao Eduardo Luis <j...@suse.de> wrote: >>> >>>> On 07/25/2016 05:55 PM, Sergio A. de Carvalho Jr. wrote: >>>> >>>>> I just forced an NTP updated on all hosts to be sure it's down to clock >>>>> skew. I also checked that hosts can reach all other hosts on port 6789. >>>>> >>>>> I then stopped monitor 0 (60z0m02) and started monitor 1 (60zxl02), but >>>>> the 3 monitors left (1 - 60zxl02, 2 - 610wl02, 4 - 615yl02) were still >>>>> having problems to reach quorum. That led me to believe monitor 4 was >>>>> the problem because I had a quorum before with monitors 0, 1, 2. >>>>> >>>>> So I stopped monitor 4 and started monitor 0 again, but this time >>>>> monitors 0, 1, 2 failed to reach a quorum, which is rather puzzling. >>>>> >>>>> All hosts are pretty much idle all the time so I can't see why monitors >>>>> would be getting stuck. >>>>> >>>> >>>> Grab 'ceph daemon mon.<ID> mon_status' from all monitors, set 'debug >>>> mon = 10', 'debug paxos = 10', 'debug ms = 1', grab logs from the monitors >>>> for the periods before, during and after election (ideally for at least two >>>> election cycles). Put them up somewhere on the interwebs and send us a >>>> link. >>>> >>>> If you don't feel comfortable putting that link in the list, send me an >>>> email directly with the url. >>>> >>>> I'll be happy to take a look later tonight. >>>> >>>> -Joao >>>> >>>> >>>>> >>>>> >>>>> On Mon, Jul 25, 2016 at 5:18 PM, Joao Eduardo Luis <j...@suse.de >>>>> <mailto:j...@suse.de>> wrote: >>>>> >>>>> On 07/25/2016 04:34 PM, Sergio A. de Carvalho Jr. wrote: >>>>> >>>>> Thanks, Joao. >>>>> >>>>> All monitors have the exact same mom map. >>>>> >>>>> I suspect you're right that there might be some communication >>>>> problem >>>>> though. I stopped monitor 1 (60zxl02), but the other 3 monitors >>>>> still >>>>> failed to reach a quorum. I could see monitor 0 was still >>>>> declaring >>>>> victory but the others were always calling for new elections: >>>>> >>>>> 2016-07-25 15:18:59.775144 7f8760af7700 0 >>>>> log_channel(cluster) log >>>>> [INF] : mon.60z0m02@0 won leader election with quorum 0,2,4 >>>>> >>>>> >>>>> 2016-07-25 15:18:54.702176 7fc1b357d700 1 >>>>> mon.610wl02@2(electing) e5 >>>>> handle_timecheck drop unexpected msg >>>>> 2016-07-25 15:18:54.704526 7fc1b357d700 1 >>>>> mon.610wl02@2(electing).data_health(11626) service_dispatch >>>>> not in >>>>> quorum -- drop message >>>>> 2016-07-25 15:19:09.792511 <tel:09.792511> 7fc1b3f7e700 1 >>>>> mon.610wl02@2(peon).paxos(paxos recovering c 1318755..1319322) >>>>> lease_timeout -- calling new election >>>>> 2016-07-25 15:19:09.792825 <tel:09.792825> 7fc1b357d700 0 >>>>> >>>>> log_channel(cluster) log >>>>> [INF] : mon.610wl02 calling new monitor election >>>>> >>>>> >>>>> I'm curious about the "handle_timecheck drop unexpected msg" >>>>> message. >>>>> >>>>> >>>>> timechecks (i.e., checking for clock skew), as well as the >>>>> data_health service (which makes sure you have enough disk space in >>>>> the mon data dir) are only run when you have a quorum. If a message >>>>> is received by a monitor not in a quorum, regardless of state, it >>>>> will be dropped. >>>>> >>>>> Assuming you know took one of the self-appointed leaders out - by >>>>> shutting it down, for instance -, you should now check what's >>>>> causing elections not to hold. >>>>> >>>>> In these cases, assuming your 3 monitors do form a quorum, the >>>>> traditional issue tends to be 'lease timeouts'. I.e., the leader >>>>> fails to provide a lease extension on paxos for the peons, and the >>>>> peons assume the leader failed in some form (unresponsive, down, >>>>> whatever). >>>>> >>>>> Above it does seem a lease timeout was triggered on a peon. This >>>>> may >>>>> have happened because: >>>>> >>>>> 1. leader did not extend the lease >>>>> 2. leader did extend the lease but lease was in the past - usually >>>>> indication of a clock skew on the leader, on the peons, or both. >>>>> 3. leader did extend the lease, it was with the correct time but >>>>> peon failed to dispatch the message on time. >>>>> >>>>> Both 1. and 2. may be due to several factors, but most commonly >>>>> it's >>>>> because the monitor was stuck doing something. This something, more >>>>> often than not, is leveldb. If this is the case, check the size of >>>>> your leveldb. If it is over 5 or 6GB in size, you may need to >>>>> manually compact the store (mon compact on start = true, iirc). >>>>> >>>>> HTH >>>>> >>>>> -Joao >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Jul 25, 2016 at 4:10 PM, Joao Eduardo Luis < >>>>> j...@suse.de >>>>> <mailto:j...@suse.de> >>>>> <mailto:j...@suse.de <mailto:j...@suse.de>>> wrote: >>>>> >>>>> On 07/25/2016 03:41 PM, Sergio A. de Carvalho Jr. wrote: >>>>> >>>>> In the logs, there 2 monitors are constantly reporting >>>>> that they >>>>> won the >>>>> leader election: >>>>> >>>>> 60z0m02 (monitor 0): >>>>> 2016-07-25 14:31:11.644335 7f8760af7700 0 >>>>> log_channel(cluster) log >>>>> [INF] : mon.60z0m02@0 won leader election with >>>>> quorum 0,2,4 >>>>> 2016-07-25 14:31:44.521552 7f8760af7700 1 >>>>> mon.60z0m02@0(leader).paxos(paxos recovering c >>>>> 1318755..1319320) >>>>> collect >>>>> timeout, calling fresh election >>>>> >>>>> 60zxl02 (monitor 1): >>>>> 2016-07-25 14:31:59.542346 7fefdeaed700 1 >>>>> mon.60zxl02@1(electing).elector(11441) init, last >>>>> seen >>>>> epoch 11441 >>>>> 2016-07-25 14:32:04.583929 7fefdf4ee700 0 >>>>> log_channel(cluster) log >>>>> [INF] : mon.60zxl02@1 won leader election with >>>>> quorum 1,2,4 >>>>> 2016-07-25 14:32:33.440103 7fefdf4ee700 1 >>>>> mon.60zxl02@1(leader).paxos(paxos recovering c >>>>> 1318755..1319319) >>>>> collect >>>>> timeout, calling fresh election >>>>> >>>>> >>>>> >>>>> There are two likely scenarios to explain this: >>>>> >>>>> 1. The monitors have different monitors in their monmaps >>>>> - this >>>>> could happen if you didn't add the new monitor via 'ceph >>>>> mon add'. >>>>> You can check this by running 'ceph daemon mon.<ID> >>>>> mon_status' for >>>>> each of the monitors in the cluster. >>>>> >>>>> 2. some of the monitors are unable to communicate with >>>>> each >>>>> other, >>>>> thus will never acknowledge the same leader. This does not >>>>> mean you >>>>> have two leaders for the same cluster, but it does mean >>>>> that you >>>>> will end up having two monitors declaring victory and >>>>> become the >>>>> self-proclaimed leader in the cluster. The peons should >>>>> still only >>>>> belong to one quorum. >>>>> >>>>> If this does not help you, try setting 'debug mon = 10' >>>>> and >>>>> 'debug >>>>> ms = 1' on the monitors and check the logs, making sure >>>>> the >>>>> monitors >>>>> get the probes and follow the election process. If you >>>>> need >>>>> further >>>>> assistance, put those logs online somewhere we can access >>>>> them and >>>>> we'll try to help you out. >>>>> >>>>> -Joao >>>>> >>>>> >>>>> >>>>> On Mon, Jul 25, 2016 at 3:27 PM, Sergio A. de >>>>> Carvalho Jr. >>>>> <scarvalh...@gmail.com <mailto:scarvalh...@gmail.com> >>>>> <mailto:scarvalh...@gmail.com <mailto:scarvalh...@gmail.com>> >>>>> <mailto:scarvalh...@gmail.com >>>>> <mailto:scarvalh...@gmail.com> <mailto:scarvalh...@gmail.com >>>>> <mailto:scarvalh...@gmail.com>>>> >>>>> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have a cluster of 5 hosts running Ceph 0.94.6 >>>>> on >>>>> CentOS >>>>> 6.5. On >>>>> each host, there is 1 monitor and 13 OSDs. We had >>>>> an issue >>>>> with the >>>>> network and for some reason (which I still don't >>>>> know why), the >>>>> servers were restarted. One host is still down, >>>>> but the >>>>> monitors on >>>>> the 4 remaining servers are failing to enter a >>>>> quorum. >>>>> >>>>> I managed to get a quorum of 3 monitors by >>>>> stopping all Ceph >>>>> monitors and OSDs across all machines, and >>>>> bringing up the >>>>> top 3 >>>>> ranked monitors in order of rank. After a few >>>>> minutes, the >>>>> 60z0m02 >>>>> monitor (the top ranked one) became the leader: >>>>> >>>>> { >>>>> "name": "60z0m02", >>>>> "rank": 0, >>>>> "state": "leader", >>>>> "election_epoch": 11328, >>>>> "quorum": [ >>>>> 0, >>>>> 1, >>>>> 2 >>>>> ], >>>>> "outside_quorum": [], >>>>> "extra_probe_peers": [], >>>>> "sync_provider": [], >>>>> "monmap": { >>>>> "epoch": 5, >>>>> "fsid": >>>>> "2f51a247-3155-4bcf-9aee-c6f6b2c5e2af", >>>>> "modified": "2016-04-28 >>>>> 22:26:48.604393", >>>>> "created": "0.000000", >>>>> "mons": [ >>>>> { >>>>> "rank": 0, >>>>> "name": "60z0m02", >>>>> "addr": "10.98.2.166:6789 >>>>> <http://10.98.2.166:6789> >>>>> <http://10.98.2.166:6789> <http://10.98.2.166:6789 >>>>> >\/0" >>>>> }, >>>>> { >>>>> "rank": 1, >>>>> "name": "60zxl02", >>>>> "addr": "10.98.2.167:6789 >>>>> <http://10.98.2.167:6789> >>>>> <http://10.98.2.167:6789> <http://10.98.2.167:6789 >>>>> >\/0" >>>>> }, >>>>> { >>>>> "rank": 2, >>>>> "name": "610wl02", >>>>> "addr": "10.98.2.173:6789 >>>>> <http://10.98.2.173:6789> >>>>> <http://10.98.2.173:6789> <http://10.98.2.173:6789 >>>>> >\/0" >>>>> }, >>>>> { >>>>> "rank": 3, >>>>> "name": "618yl02", >>>>> "addr": "10.98.2.214:6789 >>>>> <http://10.98.2.214:6789> >>>>> <http://10.98.2.214:6789> <http://10.98.2.214:6789 >>>>> >\/0" >>>>> }, >>>>> { >>>>> "rank": 4, >>>>> "name": "615yl02", >>>>> "addr": "10.98.2.216:6789 >>>>> <http://10.98.2.216:6789> >>>>> <http://10.98.2.216:6789> <http://10.98.2.216:6789 >>>>> >\/0" >>>>> >>>>> >>>>> } >>>>> ] >>>>> } >>>>> } >>>>> >>>>> The other 2 monitors became peons: >>>>> >>>>> "name": "60zxl02", >>>>> "rank": 1, >>>>> "state": "peon", >>>>> "election_epoch": 11328, >>>>> "quorum": [ >>>>> 0, >>>>> 1, >>>>> 2 >>>>> ], >>>>> >>>>> "name": "610wl02", >>>>> "rank": 2, >>>>> "state": "peon", >>>>> "election_epoch": 11328, >>>>> "quorum": [ >>>>> 0, >>>>> 1, >>>>> 2 >>>>> ], >>>>> >>>>> I then proceeded to start the fourth monitor, >>>>> 615yl02 >>>>> (618yl02 is >>>>> powered off), but after more than 2 hours and >>>>> several election >>>>> rounds, the monitors still haven't reached a >>>>> quorum. The >>>>> monitors >>>>> alternate mostly between "election", "probing" >>>>> states but >>>>> they often >>>>> seem to be in different election epochs. >>>>> >>>>> Is this normal? >>>>> >>>>> Is there anything I can do to help the monitors >>>>> elect a leader? >>>>> Should I manually remove the dead host's monitor >>>>> from the >>>>> monitor map? >>>>> >>>>> I left all OSD daemons stopped while the election >>>>> is going on >>>>> purpose. Is this the best thing to do? Would >>>>> bringing the >>>>> OSDs up >>>>> help or complicate matters even more? Or doesn't >>>>> it make >>>>> any difference? >>>>> >>>>> I don't see anything obviously wrong in the >>>>> monitor logs. >>>>> They're >>>>> mostly filled with messages like the following: >>>>> >>>>> 2016-07-25 14:17:57.806148 7fc1b3f7e700 1 >>>>> mon.610wl02@2(electing).elector(11411) init, >>>>> last seen >>>>> epoch 11411 >>>>> 2016-07-25 14:17:57.829198 7fc1b7caf700 0 >>>>> log_channel(audit) log >>>>> [DBG] : from='admin socket' entity='admin socket' >>>>> cmd='mon_status' >>>>> args=[]: dispatch >>>>> 2016-07-25 14:17:57.829200 7fc1b7caf700 0 >>>>> log_channel(audit) do_log >>>>> log to syslog >>>>> 2016-07-25 14:17:57.829254 7fc1b7caf700 0 >>>>> log_channel(audit) log >>>>> [DBG] : from='admin socket' entity='admin socket' >>>>> cmd=mon_status >>>>> args=[]: finished >>>>> >>>>> Any help would be hugely appreciated. >>>>> >>>>> Thanks, >>>>> >>>>> Sergio >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >>>>> <mailto:ceph-users@lists.ceph.com >>>>> <mailto:ceph-users@lists.ceph.com>> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >>>>> <mailto:ceph-users@lists.ceph.com >>>>> <mailto:ceph-users@lists.ceph.com>> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com