For luminous you should check the corresponding _ssd-config values for 
osd_recovery_sleep and osd_max_backfills.

However, I don't think you should see a problem with the defaults with 
luminous. In fact, I had good experience with making recovery even more 
aggressive than the defaults. You might want to look through the logs if there 
are other problems, for example, with peering taking very long or other OSDs 
being marked as down temporarily (the classic "a monitor marked me down but I'm 
still running"). Could be network or CPU bottlenecks.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: huxia...@horebdata.cn <huxia...@horebdata.cn>
Sent: 25 August 2021 21:46:57
To: ceph-users
Subject: [ceph-users] How to slow down PG recovery when a failed OSD node come 
back?

Dear Cepher,

I had an all flash 3 node Ceph cluster, each node of 8 SSDs as OSDs, running 
Ceph release 12.2.13. I have the following setting
    osd_op_queue = wpq
    osd_op_queue_cut_off = high
and
    osd_recovery_sleep= 0.5
  osd_min_pg_log_entries = 3000
    osd_max_pg_log_entries = 10000
 osd_max_backfills = 1

The problem i encountered is the following: After a failed OSD node come back 
and re-join, there is 3-5 mimutes period during which the recovery workload 
overwhelming the system, making user IO almost stall. After this 3-5 mimutes, 
the recovery process seems to calm down and slow down to a reasonable level, 
give priority to user IO workload.

What happens during the crazy 3-5 minutes? and how to reduce the negative 
impact then?

any suggestions and comments are highly appreciated,

best regards,

Samuel



huxia...@horebdata.cn
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to