[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-09-04 Thread Eugen Block
Another update: Giovanna agreed to switch back to mclock_scheduler and adjust osd_snap_trim_cost to 400K. It looks very promising, after a few hours the snaptrim queue was processed. @Sridhar: thanks a lot for your valuable input! Zitat von Eugen Block : Quick update: we decided to switch t

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-30 Thread Eugen Block
Quick update: we decided to switch to wpq to see if that would confirm our suspicion, and it did. After a few hours all PGs in the snaptrim queue had been processed. We haven't looked into the average object sizes yet, maybe we'll try that approach next week or so. If you have any other ide

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-26 Thread Eugen Block
Hi, as expected the issue is not resolved and turned up again a couple of hours later. Here's the tracker issue: https://tracker.ceph.com/issues/67702 I also attached a log snippet from one osd with debug_osd 10 to the tracker. Let me know if you need anything else, I'll stay in touch wi

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-23 Thread Sridhar Seshasayee
Hi Eugen, On Fri, Aug 23, 2024 at 1:37 PM Eugen Block wrote: > Hi again, > > I have a couple of questions about this. > What exactly happened to the PGs? They were queued for snaptrimming, > but we didn't see any progress. Let's assume the average object size > in that pool was around 2 MB (I do

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-23 Thread Eugen Block
Hi again, I have a couple of questions about this. What exactly happened to the PGs? They were queued for snaptrimming, but we didn't see any progress. Let's assume the average object size in that pool was around 2 MB (I don't have the actual numbers). Does that mean if osd_snap_trim_cost (

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-22 Thread Eugen Block
Oh yeah, I think I stumbled upon that as well, but then it slipped my mind again. Thanks for pointing that out, I appreciate it! Zitat von Sridhar Seshasayee : Hi Eugen, There was a PR (https://github.com/ceph/ceph/pull/55040) related to mClock and snaptrim that was backported and available

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-22 Thread Sridhar Seshasayee
Hi Eugen, There was a PR (https://github.com/ceph/ceph/pull/55040) related to mClock and snaptrim that was backported and available from v18.2.4. The fix more accurately determines the cost (instead of priority with wpq) of snaptrim operation depending on the average size of the objects in the PG.

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-22 Thread Eugen Block
I know, I know, but since the rest seemed to work well I didn't want to change it yet but rather analyze what else was going on. And since we found a few things, it was worth it. :-) Zitat von Joachim Kraftmayer : Hi Eugen, the first what can into my mind was replace mclock with wpq. Joachi

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-22 Thread Joachim Kraftmayer
Hi Eugen, the first what can into my mind was replace mclock with wpq. Joachim Eugen Block schrieb am Do., 22. Aug. 2024, 14:31: > Just a quick update on this topic. I assisted Giovanna directly off > list. For now the issue seems resolved, although I don't think we > really fixed anything but r

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-22 Thread Eugen Block
Just a quick update on this topic. I assisted Giovanna directly off list. For now the issue seems resolved, although I don't think we really fixed anything but rather got rid of the current symptoms. A couple of findings for posterity: - There's a k8s pod creating new snap-schedules every co

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-21 Thread Giovanna Ratini
Hello Eugen, Hi (please don't drop the ML from your responses), Sorry. I didn't pay attention. I will. All PGs of pool cephfs are affected and they are in all OSDs then just pick a random one and check if anything stands out. I'm not sure if you mentioned it already, did you also try rest

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-20 Thread Eugen Block
Hi (please don't drop the ML from your responses), All PGs of pool cephfs are affected and they are in all OSDs then just pick a random one and check if anything stands out. I'm not sure if you mentioned it already, did you also try restarting OSDs? Oh, not yesterday. I do it now, then I

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-20 Thread Giovanna Ratini
ps, hier the command [rook@rook-ceph-tools-5459f7cb5b-p55np /]$ ceph pg dump | grep snaptrim | grep -v 'snaptrim_wait' | awk '{print $18}' | sort | uniq dumped all 0 10 11 2 8 9 Am 20.08.2024 um 10:25 schrieb MARTEL Arnaud: I had this problem once in the past and found that it was related to

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-20 Thread Giovanna Ratini
Hello Arnaud, I have all 6 OSDs in the List :-(. Thanks, for Idea, maybe could help other users Regards, Giovanna Am 20.08.2024 um 10:25 schrieb MARTEL Arnaud: ceph pg dump | grep snaptrim | grep -v ‘snaptrim_wait’ -- Giovanna Ratini Mail:rat...@dbvis.inf.uni-konstanz.de Phone: +49 (0) 7

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-20 Thread MARTEL Arnaud
I had this problem once in the past and found that it was related to a particular osd. To identify it, I ran the command “ceph pg dump | grep snaptrim | grep -v ‘snaptrim_wait’” and found that the osd displayed in the “UP_PRIMARY” column was almost always the same. So I restarted this osd and t

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-20 Thread Eugen Block
Did you reduce the default values I mentioned? You could also look into the historic_ops of the primary OSD for one affected PG: ceph tell osd. dump_historic_ops_by_duration But I'm not sure if that can actually help here. There are plenty of places to look at, you could turn on debug logs o

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Giovanna Ratini
Hello Eugen, yesterday after stop and go of snaptrim the queue decrease a little and then remain blocked. They didn't grow and didn't decrease. Is that good or bad? Am 19.08.2024 um 15:43 schrieb Eugen Block: There's a lengthy thread [0] where several approaches are proposed. The worst is a

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Giovanna Ratini
Hello Eugen, root@kube-master02:~# k ceph config get osd osd_pg_max_concurrent_snap_trims Info: running 'ceph' command with args: [config get osd osd_pg_max_concurrent_snap_trims] 2 root@kube-master02:~# k ceph config get osd osd_max_trimming_pgs Info: running 'ceph' command with args: [config

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Eugen Block
There's a lengthy thread [0] where several approaches are proposed. The worst is a OSD recreation, but that's the last resort, of course. What's are the current values for these configs? ceph config get osd osd_pg_max_concurrent_snap_trims ceph config get osd osd_max_trimming_pgs Maybe decrea

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Giovanna Ratini
Hallo Eugen, yes, the load is for now not too much. I stop the snap and now this is the output. No changes in the queue. root@kube-master02:~# k ceph -s Info: running 'ceph' command with args: [-s]   cluster:     id: 3a35629a-6129-4daf-9db6-36e0eda637c7     health: HEALTH_WARN     n

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Eugen Block
What happens when you disable snaptrimming entirely? ceph osd set nosnaptrim So the load on your cluster seems low, but are the OSDs heavily utilized? Have you checked iostat? Zitat von Giovanna Ratini : Hello Eugen, *root@kube-master02:~# k ceph -s* Info: running 'ceph' command with arg

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Giovanna Ratini
Hello Eugen, *root@kube-master02:~# k ceph -s* Info: running 'ceph' command with args: [-s]   cluster:     id: 3a35629a-6129-4daf-9db6-36e0eda637c7     health: HEALTH_WARN     32 pgs not deep-scrubbed in time     32 pgs not scrubbed in time   services:     mon: 3 daemons, qu

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-18 Thread Eugen Block
Can you share the current ceph status? Are the OSDs reporting anything suspicious? How is the disk utilization? Zitat von Giovanna Ratini : More information: The snaptrim take a lot of time but the he objects_trimmed are "0"  "objects_trimmed": 0, "snaptrim_duration": 500.5807601752, I

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-17 Thread Giovanna Ratini
More information: The snaptrim take a lot of time but the he objects_trimmed are "0"  "objects_trimmed": 0, "snaptrim_duration": 500.5807601752, It could explain, why the queue are growing up.. Am 17.08.2024 um 14:37 schrieb Giovanna Ratini: Hello again, I checked the pgs dump. Snapshot

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-17 Thread Giovanna Ratini
Hello again, I checked the pgs dump. Snapshot grow up Query für PG: 3.12 {     "snap_trimq": "[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-17 Thread Giovanna Ratini
Hello Eugen, thank you for your answer. I restarted all the kube-ceph nodes one after the other. Nothing has changed. ok, I deactivate the snap ... : ceph fs snap-schedule deactivate / Is there a way to see how many snapshots will be deleted per hour? Regards, Gio Am 17.08.2024 um 10:

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-17 Thread Eugen Block
Hi, have you tried to fail the mgr? Sometimes the PG stats are not correct. You could also temporarily disable snapshots to see if things settle down. Zitat von Giovanna Ratini : Hello all, We use Ceph (v18.2.2) and Rook (1.14.3) as the CSI for a Kubernetes environment. Last week, we h