56 OSD, 6-node 12.2.5 cluster on Proxmox We had multiple drives fail(about 30%) within a few days of each other, likely faster than the cluster could recover.
After the dust settled, we have 2 out of 896 pgs stuck inactive. The failed drives are completely inaccessible, so I can't mount them and attempt to export the PGs. Do I have any options besides to just consider them lost -- and how do I tell Ceph they are lost so that I can get my cluster back to normal? I already reduced min_size from 9 to 8, can't reduce it any more. The list of "down_osds_we_would_probe" have already all been marked as lost (ceph osd lost xx) ceph health detail: <snip> PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete pg 18.c is incomplete, acting [32,48,58,40,13,44,61,59,30,27,43,37] (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 'incomplete') pg 18.1e is incomplete, acting [50,49,41,58,60,46,52,37,34,63,57,16] (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 'incomplete') <snip> root@pve4:~# ceph osd erasure-code-profile get ec-84-hdd crush-device-class= crush-failure-domain=host crush-root=default k=8 m=4 plugin=isa technique=reed_sol_van Results of ceph pg 18.c query https://pastebin.com/V8nByRF6 Results of ceph pg 18.1e query https://pastebin.com/rBWwPYUn Thanks Dan
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com