Thanks for the suggestions. I've tried both -- setting osd_find_best_info_ignore_history_les = true and restarting all OSDs, as well as 'ceph osd-force-create-pg' -- but both still show incomplete
PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete pg 18.c is incomplete, acting [32,48,58,40,13,44,61,59,30,27,43,37] (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 'incomplete') pg 18.1e is incomplete, acting [50,49,41,58,60,46,52,37,34,63,57,16] (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 'incomplete') The OSDs in down_osds_we_would_probe have already been marked lost When I ran the force-create-pg command, they went to peering for a few seconds, but then went back incomplete. Updated ceph pg 18.1e query https://pastebin.com/XgZHvJXu Updated ceph pg 18.c query https://pastebin.com/N7xdQnhX Any other suggestions? Thanks again, Daniel On Sat, Mar 2, 2019 at 3:44 PM Paul Emmerich <paul.emmer...@croit.io> wrote: > On Sat, Mar 2, 2019 at 5:49 PM Alexandre Marangone > <a.marang...@gmail.com> wrote: > > > > If you have no way to recover the drives, you can try to reboot the OSDs > with `osd_find_best_info_ignore_history_les = true` (revert it afterwards), > you'll lose data. If after this, the PGs are down, you can mark the OSDs > blocking the PGs from become active lost. > > this should work for PG 18.1e, but not for 18.c. Try running "ceph osd > force-create-pg <pgid>" to reset the PGs instead. > Data will obviously be lost afterwards. > > Paul > > > > > On Sat, Mar 2, 2019 at 6:08 AM Daniel K <satha...@gmail.com> wrote: > >> > >> They all just started having read errors. Bus resets. Slow reads. Which > is one of the reasons the cluster didn't recover fast enough to compensate. > >> > >> I tried to be mindful of the drive type and specifically avoided the > larger capacity Seagates that are SMR. Used 1 SM863 for every 6 drives for > the WAL. > >> > >> Not sure why they failed. The data isn't critical at this point, just > need to get the cluster back to normal. > >> > >> On Sat, Mar 2, 2019, 9:00 AM <jes...@krogh.cc> wrote: > >>> > >>> Did they break, or did something went wronng trying to replace them? > >>> > >>> Jespe > >>> > >>> > >>> > >>> Sent from myMail for iOS > >>> > >>> > >>> Saturday, 2 March 2019, 14.34 +0100 from Daniel K <satha...@gmail.com > >: > >>> > >>> I bought the wrong drives trying to be cheap. They were 2TB WD Blue > 5400rpm 2.5 inch laptop drives. > >>> > >>> They've been replace now with HGST 10K 1.8TB SAS drives. > >>> > >>> > >>> > >>> On Sat, Mar 2, 2019, 12:04 AM <jes...@krogh.cc> wrote: > >>> > >>> > >>> > >>> Saturday, 2 March 2019, 04.20 +0100 from satha...@gmail.com < > satha...@gmail.com>: > >>> > >>> 56 OSD, 6-node 12.2.5 cluster on Proxmox > >>> > >>> We had multiple drives fail(about 30%) within a few days of each > other, likely faster than the cluster could recover. > >>> > >>> > >>> Hov did so many drives break? > >>> > >>> Jesper > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com