[ceph-users] Re: Cluster down after network outage

Stefan Kooman Wed, 12 Jul 2023 01:07:30 -0700

On 7/12/23 09:53, Frank Schilder wrote:

Hi all,


we had a network outage tonight (power loss) and restored network in the 
morning. All OSDs were running during this period. After restoring network 
peering hell broke loose and the cluster has a hard time coming back up again. 
OSDs get marked down all the time and come back later. Peering never stops.

Below is the current status, I had all OSDs shown as up for a while, but many 
were not responsive. Are there some flags that help bringing things up in a 
sequence that causes less overload on the system?


osd_recovery_delay_start

We have that set on 60 seconds. So the OSD first gets some time to peerbefore starting recovery. That might help in this case. Worth a shot.Maybe increase it to 5 minutes or more just to get all OSDs stablebefore recovery starts going?


Good luck!

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cluster down after network outage

Reply via email to