We have had multiple clusters experiencing the following situation over the
past few months on both 14.2.6 and 14.2.11. On a few instances it seemed
random , in a second situation we had temporary networking disruption, in a
third situation we accidentally made some osd changes which caused certain
OSDs to reach the hard limit of too many pgs-per-osd and said PGs were
stuck inactive. Regardless of the scenario one monitor always falls out of
quorum and seemingly can't rejoin and its logs contain the following:

2020-11-18 05:34:54.113 7f14286d9700  1 mon.a2plcephmon01@2(probing) e30
handle_auth_request failed to assign global_id
2020-11-18 05:34:54.295 7f14286d9700 -1 mon.a2plcephmon01@2(probing) e30
handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2020-11-18 05:34:55.397 7f14286d9700  1 mon.a2plcephmon01@2(probing) e30
handle_auth_request failed to assign global_id


Ultimately, we take the quick route of rebuilding the mon by wiping its DB
and re-doing a mkfs and rejoining the quorum using:

sudo -u ceph ceph-mon -i a2plcephmon01 --public-addr 10.1.1.1:3300

Luckily we have never dropped below 50% in quorum so far but we are very
interested in preventing this from happening going forward. Inspecting
"ceph mon dump" on the given clusters I see that all of the rebuilt mons
use only msgrv2 on 3300 but all of the mons that never required rebuilding
are using v1 and v2 addressing. So my question are:

- Does this monitor failure sound familiar?
- Is it problematic the manner in which we are rebuilding mons and it only
lands them on using msgr v2 or should we be using v2 only on an
all-nautilus cluster?

Thanks for any insight.


Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to