[ceph-users] Monitor persistently out-of-quorum

2020-10-28 Thread Ki Wong
Hello,

I am at my wit's end.

So I made a mistake in the configuration of my router and one
of the monitors (out of 3) dropped out of the quorum and nothing
I’ve done allow it to rejoin. That includes reinstalling the
monitor with ceph-ansible.

The connectivity issue is fixed. I’ve tested it using “nc” and
the host can connect to both port 3300 and 6789 of the other
monitors. But the wayward monitor continue to stay out of quorum.

What is wrong? I see a bunch of “EBUSY” errors in the log, with
the message:

  e1 handle_auth_request haven't formed initial quorum, EBUSY

How do I fix this? Any help would be greatly appreciated.

Many thanks,

-kc


With debug_mon at 1/10, I got these log snippets:

2020-10-28 15:40:05.961 7fb79253a700  4 mon.mgmt03@2(probing) e1 probe_timeout 
0x564050353ec0
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 bootstrap
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
sync_reset_requester
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
unregister_cluster_logger - not registered
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
cancel_probe_timeout (none scheduled)
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 monmap e1: 3 
mons at 
{mgmt01=[v2:10.0.1.1:3300/0,v1:10.0.1.1:6789/0],mgmt02=[v2:10.1.1.1:3300/0,v1:10.1.1.1:6789/0],mgmt03=[v2:10.2.1.1:3300/0,v1:10.2.1.1:6789/0]}
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 _reset
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing).auth v0 
_set_mon_num_rank num 0 rank 0
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
cancel_probe_timeout (none scheduled)
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
timecheck_finish
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
scrub_event_cancel
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 scrub_reset
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
cancel_probe_timeout (none scheduled)
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
reset_probe_timeout 0x564050347ce0 after 2 seconds
2020-10-28 15:40:05.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 probing other 
monitors
2020-10-28 15:40:07.961 7fb79253a700  4 mon.mgmt03@2(probing) e1 probe_timeout 
0x564050347ce0
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 bootstrap
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
sync_reset_requester
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
unregister_cluster_logger - not registered
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
cancel_probe_timeout (none scheduled)
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 monmap e1: 3 
mons at 
{mgmt01=[v2:10.0.1.1:3300/0,v1:10.0.1.1:6789/0],mgmt02=[v2:10.1.1.1:3300/0,v1:10.1.1.1:6789/0],mgmt03=[v2:10.2.1.1:3300/0,v1:10.2.1.1:6789/0]}
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 _reset
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing).auth v0 
_set_mon_num_rank num 0 rank 0
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
cancel_probe_timeout (none scheduled)
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
timecheck_finish
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
scrub_event_cancel
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 scrub_reset
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
cancel_probe_timeout (none scheduled)
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
reset_probe_timeout 0x564050360660 after 2 seconds
2020-10-28 15:40:07.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 probing other 
monitors
2020-10-28 15:40:09.107 7fb79253a700 -1 mon.mgmt03@2(probing) e1 
get_health_metrics reporting 7 slow ops, oldest is log(1 entries from seq 1 at 
2020-10-27 23:03:41.586915)
2020-10-28 15:40:09.961 7fb79253a700  4 mon.mgmt03@2(probing) e1 probe_timeout 
0x564050360660
2020-10-28 15:40:09.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 bootstrap
2020-10-28 15:40:09.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
sync_reset_requester
2020-10-28 15:40:09.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
unregister_cluster_logger - not registered
2020-10-28 15:40:09.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
cancel_probe_timeout (none scheduled)
2020-10-28 15:40:09.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 monmap e1: 3 
mons at 
{mgmt01=[v2:10.0.1.1:3300/0,v1:10.0.1.1:6789/0],mgmt02=[v2:10.1.1.1:3300/0,v1:10.1.1.1:6789/0],mgmt03=[v2:10.2.1.1:3300/0,v1:10.2.1.1:6789/0]}
2020-10-28 15:40:09.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 _reset
2020-10-28 15:40:09.961 7fb79253a700 10 mon.mgmt03@2(probing).auth v0 
_set_mon_num_rank num 0 rank 0
2020-10-28 15:40:09.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
cancel_probe_timeout (none scheduled)
2020-10-28 15:40:09.961 7fb79253a700 10 mon.mgmt03@2(probing) e1 
tim

[ceph-users] Re: Monitor persistently out-of-quorum

2020-10-29 Thread Ki Wong
Thanks, David.

I just double checked and they can all connect to one another,
on both v1 and v2 ports.

-kc

> On Oct 29, 2020, at 12:41 AM, David Caro  wrote:
> 
> On 10/28 17:26, Ki Wong wrote:
>> Hello,
>> 
>> I am at my wit's end.
>> 
>> So I made a mistake in the configuration of my router and one
>> of the monitors (out of 3) dropped out of the quorum and nothing
>> I’ve done allow it to rejoin. That includes reinstalling the
>> monitor with ceph-ansible.
>> 
>> The connectivity issue is fixed. I’ve tested it using “nc” and
>> the host can connect to both port 3300 and 6789 of the other
>> monitors. But the wayward monitor continue to stay out of quorum.
> 
> Just to make sure, have you tried from mon1->mon3, mon2->mon3, mon3->mon1 and
> mon3->mon2?
> 
>> 
>> What is wrong? I see a bunch of “EBUSY” errors in the log, with
>> the message:
>> 
>>  e1 handle_auth_request haven't formed initial quorum, EBUSY
>> 
>> How do I fix this? Any help would be greatly appreciated.
>> 
>> Many thanks,
>> 
>> -kc
>> 
>> 
> 
> -- 
> David Caro

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitor persistently out-of-quorum

2020-11-02 Thread Ki Wong
Folks,

We’ve finally found the issue: MTU mismatch on the switch-side.
So, my colleague noticed “tracepath” from the other monitors
to the problematic one does not return and I tracked it down
to an MTU mismatch (jumbo vs not) on the switch end. After
fixing the mismatch all is back to normal.

It turned out to be quite the head scratcher.

Thanks to all who’ve offer assistance.

-kc

> On Oct 29, 2020, at 2:17 AM, Stefan Kooman  wrote:
> 
> On 2020-10-29 01:26, Ki Wong wrote:
>> Hello,
>> 
>> I am at my wit's end.
>> 
>> So I made a mistake in the configuration of my router and one
>> of the monitors (out of 3) dropped out of the quorum and nothing
>> I’ve done allow it to rejoin. That includes reinstalling the
>> monitor with ceph-ansible.
> 
> What Ceph version?
> What kernel version (on the monitors)?
> 
> 
> Just to check some things:
> 
> make sure the mon-keyring on _all_ monitors is equal and permissions are
> correct (ceph can read the file) and read/write to the monstore.
> 
> Have you enabled msgr v1 and v2?
> Do you use DNS to detect the monitors [1].
> 
> ceph daemon mon.$mon$id daemon mon_status <- what does this give on the
> out of quorum monitor?
> 
> See the troubleshooting documentation [2] for more information.
> 
> Gr. Stefan
> 
> [1]: https://docs.ceph.com/en/latest/rados/configuration/mon-lookup-dns/
> [2]:
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io