[ceph-users] Re: Is there any way to obtain the maximum number of node failure in ceph without data loss?

2021-07-25 Thread Jerry Lee
Hello Josh, I simulated the osd.14 failure by the following steps: 1. hot unplug the disk 2. systemctl stop ceph-osd@14 3. ceph osd out 14 The used CRUSH rule to create the EC8+3 pool is described as below: # ceph osd crush rule dump erasure_hdd_mhosts { "rule_id": 8,

[ceph-users] Re: octopus garbage collector makes slow ops

2021-07-25 Thread mahnoosh shahidi
Hi Igor, Thanks for your response.This problem happens on my osds with hdd disks. I set the bluefs_buffered_io to true just for these osds but it caused my bucket index disks (which are ssd) to produce slow ops. I also tried to set bluefs_buffered_io to true in bucket index osds but they filled

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-25 Thread Ansgar Jazdzewski
Am So., 25. Juli 2021 um 18:02 Uhr schrieb Dan van der Ster : > > What do you have for the new global_id settings? Maybe set it to allow > insecure global_id auth and see if that allows the mon to join? auth_allow_insecure_global_id_reclaim is allowed as we still have some VM's not restarted #

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-25 Thread Dan van der Ster
What do you have for the new global_id settings? Maybe set it to allow insecure global_id auth and see if that allows the mon to join? > I can try to move the /var/lib/ceph/mon/ dir and recreate it!? I'm not sure it will help. Running the mon with --debug_ms=1 might give clues why it's stuck

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-25 Thread Ansgar Jazdzewski
Am So., 25. Juli 2021 um 17:17 Uhr schrieb Dan van der Ster : > > > raise the min version to nautilus > > Are you referring to the min osd version or the min client version? yes sorry was not written clearly > I don't think the latter will help. > > Are you sure that mon.osd01 can reach those

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-25 Thread Dan van der Ster
> raise the min version to nautilus Are you referring to the min osd version or the min client version? I don't think the latter will help. Are you sure that mon.osd01 can reach those other mons on ports 6789 and 3300? Do you have any notable custom ceph configurations on this cluster? .. Dan

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-25 Thread Ansgar Jazdzewski
hi Dan, hi Folks, I started the osd01 in the foreground with debugging and basically got this loop! maybe it can help to raise the min version to nautilus but I'm afraid to run those commands on a cluster in the current state mon.osd01@0(probing).auth v0 _set_mon_num_rank num 0 rank 0

[ceph-users] Re: 1/3 mons down! mon do not rejoin

2021-07-25 Thread Dan van der Ster
With four mons total then only one can be down... mon.osd01 is down already you're at the limit. It's possible that whichever reason is preventing this mon from joining will also prevent the new mon from joining. I think you should: 1. Investigate why mon.osd01 isn't coming back into the

[ceph-users] 1/3 mons down! mon do not rejoin

2021-07-25 Thread Ansgar Jazdzewski
hi folks I have a cluster running ceph 14.2.22 on ubuntu 18.04 and some hours ago one of the mons stopped working and the on-call team rebooted the node; not the mon is is not joining the ceph-cluster. TCP ports of mons are open and reachable! ceph health detail HEALTH_WARN 1/3 mons down,