Thanks for the suggestions.

I've tried both -- setting osd_find_best_info_ignore_history_les = true and
restarting all OSDs,  as well as 'ceph osd-force-create-pg' -- but both
still show incomplete

PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
    pg 18.c is incomplete, acting [32,48,58,40,13,44,61,59,30,27,43,37]
(reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs
for 'incomplete')
    pg 18.1e is incomplete, acting [50,49,41,58,60,46,52,37,34,63,57,16]
(reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs
for 'incomplete')


The OSDs in down_osds_we_would_probe have already been marked lost

When I ran  the force-create-pg command, they went to peering for a few
seconds, but then went back incomplete.

Updated ceph pg 18.1e query https://pastebin.com/XgZHvJXu
Updated ceph pg 18.c query https://pastebin.com/N7xdQnhX

Any other suggestions?



Thanks again,

Daniel



On Sat, Mar 2, 2019 at 3:44 PM Paul Emmerich <paul.emmer...@croit.io> wrote:

> On Sat, Mar 2, 2019 at 5:49 PM Alexandre Marangone
> <a.marang...@gmail.com> wrote:
> >
> > If you have no way to recover the drives, you can try to reboot the OSDs
> with `osd_find_best_info_ignore_history_les = true` (revert it afterwards),
> you'll lose data. If after this, the PGs are down, you can mark the OSDs
> blocking the PGs from become active lost.
>
> this should work for PG 18.1e, but not for 18.c. Try running "ceph osd
> force-create-pg <pgid>" to reset the PGs instead.
> Data will obviously be lost afterwards.
>
> Paul
>
> >
> > On Sat, Mar 2, 2019 at 6:08 AM Daniel K <satha...@gmail.com> wrote:
> >>
> >> They all just started having read errors. Bus resets. Slow reads. Which
> is one of the reasons the cluster didn't recover fast enough to compensate.
> >>
> >> I tried to be mindful of the drive type and specifically avoided the
> larger capacity Seagates that are SMR. Used 1 SM863 for every 6 drives for
> the WAL.
> >>
> >> Not sure why they failed. The data isn't critical at this point, just
> need to get the cluster back to normal.
> >>
> >> On Sat, Mar 2, 2019, 9:00 AM <jes...@krogh.cc> wrote:
> >>>
> >>> Did they break, or did something went wronng trying to replace them?
> >>>
> >>> Jespe
> >>>
> >>>
> >>>
> >>> Sent from myMail for iOS
> >>>
> >>>
> >>> Saturday, 2 March 2019, 14.34 +0100 from Daniel K <satha...@gmail.com
> >:
> >>>
> >>> I bought the wrong drives trying to be cheap. They were 2TB WD Blue
> 5400rpm 2.5 inch laptop drives.
> >>>
> >>> They've been replace now with HGST 10K 1.8TB SAS drives.
> >>>
> >>>
> >>>
> >>> On Sat, Mar 2, 2019, 12:04 AM <jes...@krogh.cc> wrote:
> >>>
> >>>
> >>>
> >>> Saturday, 2 March 2019, 04.20 +0100 from satha...@gmail.com <
> satha...@gmail.com>:
> >>>
> >>> 56 OSD, 6-node 12.2.5 cluster on Proxmox
> >>>
> >>> We had multiple drives fail(about 30%) within a few days of each
> other, likely faster than the cluster could recover.
> >>>
> >>>
> >>> Hov did so many drives break?
> >>>
> >>> Jesper
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to