56 OSD, 6-node 12.2.5 cluster on Proxmox

We had multiple drives fail(about 30%) within a few days of each other,
likely faster than the cluster could recover.

After the dust settled, we have 2 out of 896 pgs stuck inactive. The failed
drives are completely inaccessible, so I can't mount them and attempt to
export the PGs.

Do I have any options besides to just consider them lost -- and how do I
tell Ceph they are lost so that I can get my cluster back to normal? I
already reduced min_size from 9 to 8, can't reduce it any more. The list of
"down_osds_we_would_probe" have already all been marked as lost (ceph osd
lost xx)

ceph health detail:
<snip>
PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
    pg 18.c is incomplete, acting [32,48,58,40,13,44,61,59,30,27,43,37]
(reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs
for 'incomplete')
    pg 18.1e is incomplete, acting [50,49,41,58,60,46,52,37,34,63,57,16]
(reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs
for 'incomplete')
<snip>

root@pve4:~# ceph osd erasure-code-profile get ec-84-hdd
crush-device-class=
crush-failure-domain=host
crush-root=default
k=8
m=4
plugin=isa
technique=reed_sol_van

Results of ceph pg 18.c query https://pastebin.com/V8nByRF6
Results of ceph pg 18.1e query https://pastebin.com/rBWwPYUn

Thanks

Dan
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to