[ceph-users] Scrubbing?

2024-01-22 Thread Jan Marek
Hello,

last week I've got a HEALTH_OK on our CEPH cluster and I
started upgrade firmware in network cards.

When I had upgraded the sixth card from nine (one-by-one), this
server didn't started correctly and our ProxMox had problem with
accessing disk images on CEPH.

rbd ls pool

was OK, but:

rbd ls pool -l

didn't work. Our virtual servers had a trouble to work with
disks.

After I resolve network problem with OSD server, everythink
returning to normal state.

But I've found, that every OSD nod have very high activity: when
I've started 'iotop', there was very high load: around 180MB/s
read and 20MB/s write. In this time, cluster was in the HEALTH_OK
state. I've found, that there is a massive scrubbing activity...

After a few days, I have on our OSD nodes around 90MB/s read and
70MB/s write while 'ceph -s' have client io as 2,5MB/s read and
50MB/s write.

I've found in log file of our mon server many lines about
starting of scrubbing, but there are many messages about
starting of scrubb the same PG? I've grep'ed syslog for some of
them and attach it to this e-mail.

Is this activity OK? Why CEPH start scrubing this PG once and
once again?

And another question: Is scrubbing part of mClock scheduler?

Many thanks for explanation.

Sincerely
Jan Marek
-- 
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html
Jan 22 08:50:38 mon1 ceph-mon[1649]: 1.15e deep-scrub starts
Jan 22 08:50:42 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:50:44 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:50:46 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:50:47 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:50:48 mon1 ceph-mon[1649]: 1.15e deep-scrub starts
Jan 22 08:50:57 mon1 ceph-mon[1649]: 1.15e deep-scrub starts
Jan 22 08:50:58 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:00 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:05 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:09 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:11 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:14 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:15 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:17 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:18 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:22 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:24 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:25 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:26 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:27 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:39 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:50 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:52 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:55 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:56 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:57 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:51:58 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:04 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:07 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:09 mon1 ceph-mon[1649]: 1.15e deep-scrub starts
Jan 22 08:52:11 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:13 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:14 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:16 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:19 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:22 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:25 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:26 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:27 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:33 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:37 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:41 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:42 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:43 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:49 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:50 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:52 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:54 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:55 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:52:58 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:53:10 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:53:18 mon1 ceph-mon[1649]: 1.15e deep-scrub starts
Jan 22 08:53:19 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:53:20 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:53:22 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:53:28 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:53:29 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:53:33 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:53:36 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:53:38 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:53:39 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:53:42 mon1 ceph-mon[1649]: 1.15e scrub starts
Jan 22 08:53:44 mon1

[ceph-users] Scrubbing

2022-03-09 Thread Ray Cunningham
Hi Everyone,

We have a 900 OSD cluster and our pg scrubs aren't keeping up. We are always 
behind and have tried to tweak some of the scrub config settings to allow a 
higher priority and faster scrubbing, but it doesn't seem to make any 
difference. Does anyone have any suggestions for increasing scrub throughput?

Thank you,
Ray Cunningham

Systems Engineering and Services Manager
keepertechnology
(571) 223-7242


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Scrubbing - osd down

2020-12-11 Thread Miroslav Boháč
Hi,

I have a problem with crashing OSD daemons in our Ceph 15.2.6 cluster . The
problem was temporarily resolved by disabling scrub and deep-scrub. All PGs
are active+clean. After a few days I tried to enable scrubbing again, but
the problem persists. OSD with high latencies, PG laggy, osd not
responding, OSD was marked down and the cluster is not usable. Problem
appeared after failover when 10 OSD was marked as out.

I will appreciate any help or advice, how to resolve this problem.

Regards,
Miroslav





2020-12-10T21:39:57.721883+0100 osd.1 (osd.1) 5 : cluster [DBG] 18.11
deep-scrub starts

2020-12-10T21:39:57.880861+0100 osd.1 (osd.1) 6 : cluster [DBG] 18.11
deep-scrub ok

2020-12-10T21:39:58.713422+0100 osd.1 (osd.1) 7 : cluster [DBG] 43.5 scrub
starts

2020-12-10T21:39:58.719372+0100 osd.1 (osd.1) 8 : cluster [DBG] 43.5 scrub
ok

2020-12-10T21:39:59.296962+0100 mgr.pve2-prg2a (mgr.91575377) 118746 :
cluster [DBG] pgmap v119000: 1737 pgs: 3 active+clean+laggy, 3
active+clean+scrubbing+deep, 1731 active+clean; 3.3 TiB data, 7.8 TiB used,
21 TiB / 29 TiB avail; 117 KiB/s rd, 2.5 MiB/s wr, 269 op/s

2020-12-10T21:40:00.088421+0100 osd.29 (osd.29) 74 : cluster [DBG] 1.13b
deep-scrub starts

2020-12-10T21:40:01.300373+0100 mgr.pve2-prg2a (mgr.91575377) 118747 :
cluster [DBG] pgmap v119001: 1737 pgs: 3 active+clean+laggy, 3
active+clean+scrubbing+deep, 1731 active+clean; 3.3 TiB data, 7.8 TiB used,
21 TiB / 29 TiB avail; 101 KiB/s rd, 1.9 MiB/s wr, 202 op/s

2020-12-10T21:40:02.681058+0100 osd.34 (osd.34) 13 : cluster [DBG] 1.a2
deep-scrub ok

2020-12-10T21:40:03.304009+0100 mgr.pve2-prg2a (mgr.91575377) 118749 :
cluster [DBG] pgmap v119002: 1737 pgs: 3 active+clean+laggy, 3
active+clean+scrubbing+deep, 1731 active+clean; 3.3 TiB data, 7.8 TiB used,
21 TiB / 29 TiB avail; 101 KiB/s rd, 1.9 MiB/s wr, 198 op/s

2020-12-10T21:40:05.316233+0100 mgr.pve2-prg2a (mgr.91575377) 118750 :
cluster [DBG] pgmap v119003: 1737 pgs: 6 active+clean+laggy, 3
active+clean+scrubbing+deep, 1728 active+clean; 3.3 TiB data, 7.8 TiB used,
21 TiB / 29 TiB avail; 150 KiB/s rd, 3.0 MiB/s wr, 249 op/s

2020-12-10T21:40:07.319643+0100 mgr.pve2-prg2a (mgr.91575377) 118751 :
cluster [DBG] pgmap v119004: 1737 pgs: 6 active+clean+laggy, 3
active+clean+scrubbing+deep, 1728 active+clean; 3.3 TiB data, 7.8 TiB used,
21 TiB / 29 TiB avail; 142 KiB/s rd, 2.3 MiB/s wr, 212 op/s





2020-12-10T21:40:15.523134+0100 mon.pve1-prg2a (mon.0) 125943 : cluster
[DBG] osd.4 reported failed by osd.24

2020-12-10T21:40:15.523325+0100 mon.pve1-prg2a (mon.0) 125944 : cluster
[DBG] osd.39 reported failed by osd.24

2020-12-10T21:40:16.112299+0100 mon.pve1-prg2a (mon.0) 125946 : cluster
[WRN] Health check failed: 0 slow ops, oldest one blocked for 32 sec, osd.8
has slow ops (SLOW_OPS)

2020-12-10T21:40:16.202867+0100 mon.pve1-prg2a (mon.0) 125947 : cluster
[DBG] osd.4 reported failed by osd.34

2020-12-10T21:40:16.202986+0100 mon.pve1-prg2a (mon.0) 125948 : cluster
[INF] osd.4 failed (root=default,host=pve1-prg2a) (2 reporters from
different host after 24.000267 >= grace 22.361677)

2020-12-10T21:40:16.373925+0100 mon.pve1-prg2a (mon.0) 125949 : cluster
[DBG] osd.39 reported failed by osd.6

2020-12-10T21:40:16.865608+0100 mon.pve1-prg2a (mon.0) 125951 : cluster
[DBG] osd.39 reported failed by osd.8

2020-12-10T21:40:17.125917+0100 mon.pve1-prg2a (mon.0) 125952 : cluster
[WRN] Health check failed: 1 osds down (OSD_DOWN)

2020-12-10T21:40:17.139006+0100 mon.pve1-prg2a (mon.0) 125953 : cluster
[DBG] osdmap e12819: 40 total, 39 up, 30 in

2020-12-10T21:40:17.140248+0100 mon.pve1-prg2a (mon.0) 125954 : cluster
[DBG] osd.39 reported failed by osd.21

2020-12-10T21:40:17.344244+0100 mgr.pve2-prg2a (mgr.91575377) 118757 :
cluster [DBG] pgmap v119012: 1737 pgs: 9 peering, 61 stale+active+clean, 1
active+clean+scrubbing+deep, 7 active+clean+laggy, 1659 active+clean; 3.3
TiB data, 7.8 TiB used, 14 TiB / 22 TiB avail; 44 KiB/s rd, 2.5 MiB/s wr,
107 op/s

2020-12-10T21:40:17.378069+0100 mon.pve1-prg2a (mon.0) 125955 : cluster
[DBG] osd.39 reported failed by osd.26

2020-12-10T21:40:17.424429+0100 mon.pve1-prg2a (mon.0) 125956 : cluster
[DBG] osd.39 reported failed by osd.18

2020-12-10T21:40:17.829447+0100 mon.pve1-prg2a (mon.0) 125957 : cluster
[DBG] osd.39 reported failed by osd.36

2020-12-10T21:40:17.847373+0100 mon.pve1-prg2a (mon.0) 125958 : cluster
[DBG] osd.39 reported failed by osd.1

2020-12-10T21:40:17.858371+0100 mon.pve1-prg2a (mon.0) 125959 : cluster
[DBG] osd.39 reported failed by osd.17

2020-12-10T21:40:17.915755+0100 mon.pve1-prg2a (mon.0) 125960 : cluster
[DBG] osd.39 reported failed by osd.28





2020-12-10T21:40:24.151192+0100 mon.pve1-prg2a (mon.0) 125986 : cluster
[INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
availability: 1 pg peering)

2020-12-10T21:40:24.608038+0100 mon.pve1-prg2a (mon.0) 125987 : cluster
[WRN] Health check update: 0 slow ops, oldest one blocked for 37 sec, osd.8
has slow ops (SLOW_OPS)

2020-12-10T21:40:25.37

[ceph-users] scrubbing+deep+repair PGs since Upgrade

2022-06-26 Thread Marcus Müller
Hi all,

we recently upgraded from Ceph Luminous (12.x) to Ceph Octopus (15.x) (of 
course with Mimic and Nautilus in between). Since this upgrade we see we 
constant number of active+clean+scrubbing+deep+repair PGs. We never had this in 
the past, now every time (like 10 or 20 PGs at the same time with the +repair 
flag). 

Does anyone know how to debug this more in detail ? 


Regards,
Marcus
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io