[ceph-users] Re: OSD slow ops warning not clearing after OSD down

Dan van der Ster Mon, 03 May 2021 06:09:24 -0700

Wait, first just restart the leader mon.

See: https://tracker.ceph.com/issues/47380 for a related issue.


-- dan

On Mon, May 3, 2021 at 2:55 PM Vladimir Sigunov
<vladimir.sigu...@gmail.com> wrote:
>
> Hi Frank,
> Yes, I would purge the osd. The cluster looks absolutely healthy except of 
> this osd.584 Probably,  the purge will help the cluster to forget this faulty 
> one. Also, I would restart monitors, too.
> With the amount of data you maintain in your cluster, I don't think your 
> ceph.conf contains any information about some particular osds, but if it 
> does, don't forget to remove the configuration of osd.584 from the ceph.conf
>
> Get Outlook for Android<https://aka.ms/ghei36>
>
> ________________________________
> From: Frank Schilder <fr...@dtu.dk>
> Sent: Monday, May 3, 2021 8:37:09 AM
> To: Vladimir Sigunov <vladimir.sigu...@gmail.com>; ceph-users@ceph.io 
> <ceph-users@ceph.io>
> Subject: Re: OSD slow ops warning not clearing after OSD down
>
> Hi Vladimir,
>
> thanks for your reply. I did, the cluster is healthy:
>
> [root@gnosis ~]# ceph status
>   cluster:
>     id:     ---
>     health: HEALTH_WARN
>             430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
>   services:
>     mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
>     mgr: ceph-01(active), standbys: ceph-02, ceph-03
>     mds: con-fs2-2/2/2 up  {0=ceph-08=up:active,1=ceph-12=up:active}, 2 
> up:standby
>     osd: 584 osds: 578 up, 578 in
>
>   data:
>     pools:   11 pools, 3215 pgs
>     objects: 610.3 M objects, 1.2 PiB
>     usage:   1.5 PiB used, 4.6 PiB / 6.0 PiB avail
>     pgs:     3191 active+clean
>              13   active+clean+scrubbing+deep
>              9    active+clean+snaptrim_wait
>              2    active+clean+snaptrim
>
>   io:
>     client:   358 MiB/s rd, 56 MiB/s wr, 2.35 kop/s rd, 1.32 kop/s wr
>
> [root@gnosis ~]# ceph health detail
> HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
> SLOW_OPS 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
> OSD 580 is down+out and the message does not even increment the seconds. Its 
> probably stuck in some part of the health checking that tries to query 580 
> and doesn't understand that the OSD being down means there are no ops.
>
> I tried to restart the OSD on this disk, but it seems completely rigged. The 
> iDRAC log on the server says that the disk was removed during operation 
> possibly due to a physical connection fail on the SAS lanes. I somehow need 
> to get rid of this message and am wondering of purging the OSD would help.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Vladimir Sigunov <vladimir.sigu...@gmail.com>
> Sent: 03 May 2021 13:45:19
> To: ceph-users@ceph.io; Frank Schilder
> Subject: Re: OSD slow ops warning not clearing after OSD down
>
> Hi Frank.
> Check your cluster for inactive/incomplete placement groups. I saw similar 
> behavior on Octopus when some pgs stuck in incomplete/inactive or peering 
> state.
>
> ________________________________
> From: Frank Schilder <fr...@dtu.dk>
> Sent: Monday, May 3, 2021 3:42:48 AM
> To: ceph-users@ceph.io <ceph-users@ceph.io>
> Subject: [ceph-users] OSD slow ops warning not clearing after OSD down
>
> Dear cephers,
>
> I have a strange problem. An OSD went down and recovery finished. For some 
> reason, I have a slow ops warning for the failed OSD stuck in the system:
>
>     health: HEALTH_WARN
>             430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
> The OSD is auto-out:
>
> | 580 | ceph-22 |    0  |    0  |    0   |     0   |    0   |     0   | 
> autoout,exists |
>
> It is probably a warning dating back to just before the fail. How can I clear 
> it?
>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD slow ops warning not clearing after OSD down

Reply via email to