[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-26 Thread Ansgar Jazdzewski
Yes, the empty DB told me that at this point I had no other choice than recreate the entire mon service. * remove broken mon ceph mon remove $(hostname -s) * mon preparation done rm -rf /var/lib/ceph/mon/ceph-$(hostname -s) mkdir /var/lib/ceph/mon/ceph-$(hostname -s) ceph auth get mon. -o

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-26 Thread Dan van der Ster
Your log ends with > 2021-07-25 06:46:52.078 7fe065f24700 1 mon.osd01@0(leader).osd e749666 > do_prune osdmap full prune enabled So mon.osd01 was still the leader at that time. When did it leave the cluster? > I also found that the rocksdb on osd01 is only 1MB in size and 345MB on the > other

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-26 Thread Ansgar Jazdzewski
Hi Dan, Hi Folks, this is how things started, I also found that the rocksdb on osd01 is only 1MB in size and 345MB on the other mons! 2021-07-25 06:46:30.029 7fe061f1c700 0 log_channel(cluster) log [DBG] : monmap e1: 3 mons at {osd01=[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0],osd02=[v2:10

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-26 Thread Dan van der Ster
Hi, Do you have ceph-mon logs from when mon.osd01 first failed before the on-call team rebooted it? They might give a clue what happened to start this problem, which maybe is still happening now. This looks similar but it was eventually found to be a network issue: https://tracker.ceph.com/issues

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-25 Thread Ansgar Jazdzewski
Am So., 25. Juli 2021 um 18:02 Uhr schrieb Dan van der Ster : > > What do you have for the new global_id settings? Maybe set it to allow > insecure global_id auth and see if that allows the mon to join? auth_allow_insecure_global_id_reclaim is allowed as we still have some VM's not restarted #

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-25 Thread Dan van der Ster
What do you have for the new global_id settings? Maybe set it to allow insecure global_id auth and see if that allows the mon to join? > I can try to move the /var/lib/ceph/mon/ dir and recreate it!? I'm not sure it will help. Running the mon with --debug_ms=1 might give clues why it's stuck prob

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-25 Thread Ansgar Jazdzewski
Am So., 25. Juli 2021 um 17:17 Uhr schrieb Dan van der Ster : > > > raise the min version to nautilus > > Are you referring to the min osd version or the min client version? yes sorry was not written clearly > I don't think the latter will help. > > Are you sure that mon.osd01 can reach those oth

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-25 Thread Dan van der Ster
> raise the min version to nautilus Are you referring to the min osd version or the min client version? I don't think the latter will help. Are you sure that mon.osd01 can reach those other mons on ports 6789 and 3300? Do you have any notable custom ceph configurations on this cluster? .. Dan

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-25 Thread Ansgar Jazdzewski
hi Dan, hi Folks, I started the osd01 in the foreground with debugging and basically got this loop! maybe it can help to raise the min version to nautilus but I'm afraid to run those commands on a cluster in the current state mon.osd01@0(probing).auth v0 _set_mon_num_rank num 0 rank 0 mon.osd01@0

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-25 Thread Dan van der Ster
With four mons total then only one can be down... mon.osd01 is down already you're at the limit. It's possible that whichever reason is preventing this mon from joining will also prevent the new mon from joining. I think you should: 1. Investigate why mon.osd01 isn't coming back into the quorum.