Thanks,
I was able to get things back into a good state.  I had to restart a few osd's 
and I also noticed that at a point...all of the pg's that were preventing full 
recovery involved osd.8.  I removed that osd and things moved forward.  I 
reviewed the raid controller logs for that osd and although the disk was still 
listed as healthy...I found some errors in the controller log that must have 
been causing some problems reading some amount of data.

Thanks again.

Shain

On 7/23/21, 3:35 PM, "dhils...@performair.com" <dhils...@performair.com> wrote:

    Sean;

    These lines look bad:
    14 scrub errors
    Reduced data availability: 2 pgs inactive
    Possible data damage: 8 pgs inconsistent
    osd.95 (root=default,host=hqosd8) is down

    I suspect you ran into a hardware issue with one more drives in some of the 
servers that did not go offline.

    osd.95 is offline, you need to resolve this.

    You should fix your tunables, when you can (probably not part of your 
current issues).

    Thank you,

    Dominic L. Hilsbos, MBA 
    Vice President – Information Technology 
    Perform Air International Inc.
    dhils...@performair.com 
    
https://urldefense.com/v3/__http://www.PerformAir.com__;!!Iwwt!FAQkxiDS80ZWksiJket210Oc_wLsRih_-WqhguEb44tq0_Ao7aqrgeIO_C8$
 

    -----Original Message-----
    From: Shain Miley [mailto:smi...@npr.org] 
    Sent: Friday, July 23, 2021 10:48 AM
    To: ceph-users@ceph.io
    Subject: [ceph-users] Luminous won't fully recover

    We recently had a few Ceph nodes go offline which required a reboot.  I 
have been able to get the cluster back to the state listed below however it 
does not seem like it will progress past the point of 23473/287823588 objects 
misplaced.



    Yesterday it was about 13% of the data that was misplaced…however this 
morning it has goteen to 0.008% but has not moved past this point in about an 
hour.



    Does anyone see anything in the output below that points to the problem 
and/or are there any suggestions that I can follow in order to figure out why 
the cluster health is not moving beyond this point?





    ---------------------------------------------------

    root@rbd1:~# ceph -s

    cluster:

        id:     504b5794-34bd-44e7-a8c3-0494cf800c23

        health: HEALTH_ERR

                crush map has legacy tunables (require argonaut, min is firefly)

                23473/287823588 objects misplaced (0.008%)

                14 scrub errors

                Reduced data availability: 2 pgs inactive

                Possible data damage: 8 pgs inconsistent



      services:

        mon: 3 daemons, quorum hqceph1,hqceph2,hqceph3

        mgr: hqceph2(active), standbys: hqceph3

        osd: 288 osds: 270 up, 270 in; 2 remapped pgs

        rgw: 1 daemon active



      data:

        pools:   17 pools, 9411 pgs

        objects: 95.95M objects, 309TiB

        usage:   936TiB used, 627TiB / 1.53PiB avail

        pgs:     0.021% pgs not active

                 23473/287823588 objects misplaced (0.008%)

                 9369 active+clean

                 30   active+clean+scrubbing+deep

                 8    active+clean+inconsistent

                 2    activating+remapped

                 2    active+clean+scrubbing



      io:

        client:   1000B/s rd, 0B/s wr, 0op/s rd, 0op/s wr



    root@rbd1:~# ceph health detail

    HEALTH_ERR crush map has legacy tunables (require argonaut, min is 
firefly); 1 osds down; 23473/287823588 objects misplaced (0.008%); 14 scrub 
errors; Reduced data availability: 3 pgs inactive, 13 pgs peering; Possible 
data damage: 8 pgs inconsistent; Degraded data redundancy: 408658/287823588 
objects degraded (0.142%), 38 pgs degraded

    OLD_CRUSH_TUNABLES crush map has legacy tunables (require argonaut, min is 
firefly)

        see 
https://urldefense.com/v3/__http://docs.ceph.com/docs/master/rados/operations/crush-map/*tunables__;Iw!!Iwwt!FAQkxiDS80ZWksiJket210Oc_wLsRih_-WqhguEb44tq0_Ao7aqrwpPnNRE$
 

    OSD_DOWN 1 osds down

        osd.95 (root=default,host=hqosd8) is down

    OBJECT_MISPLACED 23473/287823588 objects misplaced (0.008%)

    OSD_SCRUB_ERRORS 14 scrub errors

    PG_AVAILABILITY Reduced data availability: 3 pgs inactive, 13 pgs peering

        pg 3.b41 is stuck peering for 106.682058, current state peering, last 
acting [204,190]

        pg 3.c33 is stuck peering for 103.403643, current state peering, last 
acting [228,274]

        pg 3.d15 is stuck peering for 128.537454, current state peering, last 
acting [286,24]

        pg 3.fa9 is stuck peering for 106.526146, current state peering, last 
acting [286,47]

        pg 3.fb7 is stuck peering for 105.878878, current state peering, last 
acting [62,97]

        pg 3.13a2 is stuck peering for 106.491138, current state peering, last 
acting [270,219]

        pg 3.1521 is stuck inactive for 170180.165265, current state 
activating+remapped, last acting [94,186,188]

        pg 3.1565 is stuck peering for 106.782784, current state peering, last 
acting [121,60]

        pg 3.157c is stuck peering for 128.557448, current state peering, last 
acting [128,268]

        pg 3.1744 is stuck peering for 106.639603, current state peering, last 
acting [192,142]

        pg 3.1ac8 is stuck peering for 127.839550, current state peering, last 
acting [221,190]

        pg 3.1e24 is stuck peering for 128.201670, current state peering, last 
acting [118,158]

        pg 3.1e46 is stuck inactive for 169121.764376, current state 
activating+remapped, last acting [87,199,170]

        pg 18.36 is stuck peering for 128.554121, current state peering, last 
acting [204]

        pg 21.1ce is stuck peering for 106.582584, current state peering, last 
acting [266,192]

    PG_DAMAGED Possible data damage: 8 pgs inconsistent

        pg 3.1ca is active+clean+inconsistent, acting [201,8,180]

        pg 3.56a is active+clean+inconsistent, acting [148,240,8]

        pg 3.b0f is active+clean+inconsistent, acting [148,260,8]

        pg 3.b56 is active+clean+inconsistent, acting [218,8,240]

        pg 3.10ff is active+clean+inconsistent, acting [262,8,211]

        pg 3.1192 is active+clean+inconsistent, acting [192,8,187]

        pg 3.124a is active+clean+inconsistent, acting [123,8,222]

        pg 3.1c55 is active+clean+inconsistent, acting [180,8,287]

    PG_DEGRADED Degraded data redundancy: 408658/287823588 objects degraded 
(0.142%), 38 pgs degraded

        pg 3.8f is active+undersized+degraded, acting [163,149]

        pg 3.ba is active+undersized+degraded, acting [68,280]

        pg 3.1aa is active+undersized+degraded, acting [176,211]

        pg 3.29e is active+undersized+degraded, acting [241,194]

        pg 3.323 is active+undersized+degraded, acting [78,194]

        pg 3.343 is active+undersized+degraded, acting [242,144]

        pg 3.4ae is active+undersized+degraded, acting [153,237]

        pg 3.524 is active+undersized+degraded, acting [252,222]

        pg 3.5c9 is active+undersized+degraded, acting [272,252]

        pg 3.713 is active+undersized+degraded, acting [273,80]

        pg 3.730 is active+undersized+degraded, acting [235,212]

        pg 3.88f is active+undersized+degraded, acting [222,285]

        pg 3.8cb is active+undersized+degraded, acting [285,20]

        pg 3.9a0 is active+undersized+degraded, acting [240,200]

        pg 3.c19 is active+undersized+degraded, acting [165,276]

        pg 3.ec8 is active+undersized+degraded, acting [158,40]

        pg 3.1025 is active+undersized+degraded, acting [258,274]

        pg 3.1058 is active+undersized+degraded, acting [38,68]

        pg 3.14e4 is active+undersized+degraded, acting [185,39]

        pg 3.150c is active+undersized+degraded, acting [138,140]

        pg 3.1545 is active+undersized+degraded, acting [222,55]

        pg 3.15a6 is active+undersized+degraded, acting [242,272]

        pg 3.1620 is active+undersized+degraded, acting [200,164]

        pg 3.1710 is active+undersized+degraded, acting [176,285]

        pg 3.1792 is active+undersized+degraded, acting [190,11]

        pg 3.17bd is active+undersized+degraded, acting [207,15]

        pg 3.17da is active+undersized+degraded, acting [5,160]

        pg 3.183e is active+undersized+degraded, acting [273,136]

        pg 3.197d is active+undersized+degraded, acting [241,139]

        pg 3.1a3d is active+undersized+degraded, acting [184,121]

        pg 3.1ba6 is active+undersized+degraded, acting [47,249]

        pg 3.1c2b is active+undersized+degraded, acting [268,80]

        pg 3.1ca2 is active+undersized+degraded, acting [280,152]

        pg 3.1cd4 is active+undersized+degraded, acting [2,129]

        pg 3.1e13 is active+undersized+degraded, acting [247,114]

        pg 12.56 is active+undersized+degraded, acting [54]

        pg 18.8 is undersized+degraded+peered, acting [260]

        pg 21.9f is active+undersized+degraded, acting [215,201]
    
--------------------------------------------------------------------------------------------------


    Thanks,
    Shain

    Shain Miley | Director of Platform and Infrastructure | Digital Media | 
smi...@npr.org
    _______________________________________________
    ceph-users mailing list -- ceph-users@ceph.io
    To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to