Hello list,
We have a Ceph cluster (17.2.6 quincy) with 2 admin nodes and 6 storage nodes,
each storage node connected to a JBOD enclosure. Each enclosure houses 28 HDD
disks of 18 TB size, totaling 168 OSDs. The pool that houses the majority of
the data is erasure-coded (4+2). We have recently had one disk failure, which
brought one OSD down:
# ceph osd tree | grep down
2 hdd 16.49579 osd.2 down 0 1.00000
This OSD is out of the cluster, but we haven't replaced it physically yet. The
problem that we are facing is that the cluster was not in the best shape when
this OSD failed. Currently we have the following:
################################################
cluster:
id: <redacted>
health: HEALTH_ERR
1026 scrub errors
Possible data damage: 18 pgs inconsistent
2137 pgs not deep-scrubbed in time
2137 pgs not scrubbed in time
services:
mon: 5 daemons, quorum xyz-admin1,xyz-admin2,xyz-osd1,xyz-osd2,xyz-osd3
(age 17M)
mgr: xyz-admin2.sipadf(active, since 17M), standbys: xyz-admin1.nwaovh
mds: 2/2 daemons up, 2 standby
osd: 168 osds: 167 up (since 44h), 167 in (since 6w); 220 remapped pgs
data:
volumes: 2/2 healthy
pools: 9 pools, 2137 pgs
objects: 448.54M objects, 1.0 PiB
usage: 1.6 PiB used, 1.1 PiB / 2.7 PiB avail
pgs: 134404830/2676514497 objects misplaced (5.022%)
1902 active+clean
191 active+remapped+backfilling
26 active+remapped+backfill_wait
15 active+clean+inconsistent
2 active+remapped+inconsistent+backfilling
1 active+remapped+inconsistent+backfill_wait
io:
recovery: 597 MiB/s, 252 objects/s
progress:
Global Recovery Event (6w)
[=========================...] (remaining: 5d)
################################################
I have noticed the number of active+clean increasing (was ~1750 two days ago),
and objects misplaced very slowly decreasing. My question is, should I wait
until recovery is complete, then repair the 18 damaged pg, and only then
replace the disk? My thinking is that replacing the disk will trigger more
backfilling which will slow down the recovering even more.
Another question, should I disable scrubbing while the recovery is not
finalized?
Thank you for any insights you may be able to provide!
-
Gustavo
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]