[ceph-users] Re: Quincy recovery load

2022-07-06 Thread Sridhar Seshasayee
Hi Jimmy, As you rightly pointed out, the OSD recovery priority does not work because of the change to mClock. By default, the "high_client_ops" profile is enabled and this optimizes client ops when compared to recovery ops. Recovery ops will take the longest time to complete with this profile and

[ceph-users] Re: Quincy recovery load

2022-07-06 Thread Anthony D'Atri
Do you mean load average as reported by `top` or `uptime`? That figure can be misleading on multi-core systems. What CPU are you using? For context, when I ran systems with 32C/64T and 24x SATA SSD, the load average could easily hit 40-60 without anything being wrong. What CPU percentages in

[ceph-users] Re: Quincy recovery load

2022-07-06 Thread Jimmy Spets
Thanks for your reply. What I meant with high load was load as seen by the top command, all the servers have load average over 10. I added one more noode to add more space. This is what I get from ceph status: cluster: id: health: HEALTH_WARN 2 failed cephadm daemon(s

[ceph-users] Re: Quincy recovery load

2022-07-06 Thread Jimmy Spets
> Do you mean load average as reported by `top` or `uptime`? yes > That figure can be misleading on multi-core systems. What CPU are you using? It's a 4c/4t low end CPU /Jimmy On Wed, Jul 6, 2022 at 4:52 PM Anthony D'Atri wrote: > Do you mean load average as reported by `top` or `uptime`? > >

[ceph-users] Re: Quincy recovery load

2022-07-11 Thread Chris Palmer
I'm seeing a similar problem on a small cluster just upgraded from Pacific 16.2.9 to Quincy 17.2.1 (non-cephadm). The cluster was only very lightly loaded during and after the upgrade. The OSDs affected are all bluestore, HDD sharing NVMe DB/WAL, and all created on Pacific (I think). The upgra

[ceph-users] Re: Quincy recovery load

2022-07-11 Thread Chris Palmer
Correction - it is the Acting OSDs that are consuming CPU, not the UP ones On 11/07/2022 16:17, Chris Palmer wrote: I'm seeing a similar problem on a small cluster just upgraded from Pacific 16.2.9 to Quincy 17.2.1 (non-cephadm). The cluster was only very lightly loaded during and after the upg

[ceph-users] Re: Quincy recovery load

2022-07-12 Thread Chris Palmer
I've created tracker https://tracker.ceph.com/issues/56530 for this, including info on replicating it on another cluster. On 11/07/2022 17:41, Chris Palmer wrote: Correction - it is the Acting OSDs that are consuming CPU, not the UP ones On 11/07/2022 16:17, Chris Palmer wrote: I'm seeing a s

[ceph-users] Re: Quincy recovery load

2022-07-12 Thread Sridhar Seshasayee
Hi Chris, While we look into this, I have a couple of questions: 1. Did the recovery rate stay at 1 object/sec throughout? In our tests we have seen that the rate is higher during the starting phase of recovery and eventually tapers off due to throttling by mclock. 2. Can you try speedin

[ceph-users] Re: Quincy recovery load

2022-07-19 Thread Daniel Williams
Also never had problems with backfill / rebalance / recovery but now seen runaway CPU usage even with very conservative recovery settings after upgrading to quincy from pacific. osd_recovery_sleep_hdd = 0.1 osd_max_backfills = 1 osd_recovery_max_active = 1 osd_recovery_delay_start = 600 Tried: os

[ceph-users] Re: Quincy recovery load

2022-07-19 Thread Daniel Williams
Just incase people don't know osd_op_queue = "wpq" requires an OSD restart. And further to my theory about the spin lock or similar, increasing my recovery by 4-16x using wpq sees my cpu rise to 10-15% ( from 3% )... but using mclock, even at very very conservative recovery settings sees a median

[ceph-users] Re: Quincy recovery load

2022-07-19 Thread Sridhar Seshasayee
Hi Daniel, And further to my theory about the spin lock or similar, increasing my > recovery by 4-16x using wpq sees my cpu rise to 10-15% ( from 3% )... > but using mclock, even at very very conservative recovery settings sees a > median CPU usage of some multiple of 100% (eg. a multiple of a ma

[ceph-users] Re: Quincy recovery load

2022-07-19 Thread Daniel Williams
Do you think maybe you should issue an immediate change/patch/update to quincy to change the default to wpq? Given the cluster ending nature of the problem? On Wed, Jul 20, 2022 at 4:01 AM Sridhar Seshasayee wrote: > Hi Daniel, > > > And further to my theory about the spin lock or similar, incre

[ceph-users] Re: Quincy recovery load

2022-07-21 Thread Sridhar Seshasayee
On Wed, Jul 20, 2022 at 4:03 AM Daniel Williams wrote: > Do you think maybe you should issue an immediate change/patch/update to > quincy to change the default to wpq? Given the cluster ending nature of the > problem? > > Hi Daniel / All, The issue was root caused and the fix is currently made i

[ceph-users] Re: Quincy recovery load

2022-07-21 Thread Sridhar Seshasayee
I forgot to mention that the charts show CPU utilization when both client ops and recoveries are going on. The steep drop in CPU utilization is when client ops are stopped but recoveries are still going on. ___ ceph-users mailing list -- ceph-users@ceph.i

[ceph-users] Re: Quincy recovery load

2022-07-21 Thread Sridhar Seshasayee
On Fri, Jul 22, 2022 at 12:47 AM Sridhar Seshasayee wrote: > I forgot to mention that the charts show CPU utilization when both client > ops and recoveries are going on. The steep drop in CPU utilization is when > client ops are stopped but recoveries are still going on. > It looks like the char

[ceph-users] Re: Quincy recovery load

2022-07-25 Thread Satoru Takeuchi
I'm trying to upgrade my Pacific cluster to Quincy and found this thread. Let me confirm a few things. - Does this problem not exist in Pacific and older versions? - Does this problem happen only if `osd_op_queue=mclock_scheduler`? - Do all parameters written in the OPERATIONS section not work if

[ceph-users] Re: Quincy recovery load

2022-07-25 Thread Sridhar Seshasayee
On Mon, Jul 25, 2022 at 2:05 PM Satoru Takeuchi wrote: > > - Does this problem not exist in Pacific and older versions? > This problem does not exist in Pacific and prior versions. On Pacific, the default osd_op_queue is set to 'wpq' and so this issue is not observed. - Does this problem happen

[ceph-users] Re: Quincy recovery load

2022-07-25 Thread Satoru Takeuchi
2022年7月25日(月) 18:45 Sridhar Seshasayee : > > > On Mon, Jul 25, 2022 at 2:05 PM Satoru Takeuchi > wrote: > >> >> - Does this problem not exist in Pacific and older versions? >> > This problem does not exist in Pacific and prior versions. On Pacific, the > default osd_op_queue > is set to 'wpq' an