[ceph-users] Re: OSD slow ops warning not clearing after OSD down

2023-01-16 Thread Christian Rohmann

Hello,

On 04/05/2021 09:49, Frank Schilder wrote:

I created a ticket: https://tracker.ceph.com/issues/50637


We just observed this very issue on Pacific (16.2.10) , which I also 
commented on the ticket.
I wonder if this case is so seldom, first having some issues causing 
slow ops and then a total failure of an OSD ?



Would be nice to fix this though to not "block" the warning status with 
something that's not actually a warning.




Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD slow ops warning not clearing after OSD down

2021-05-04 Thread Frank Schilder
I created a ticket: https://tracker.ceph.com/issues/50637

Hope a purge will do the trick.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 03 May 2021 15:21:38
To: Dan van der Ster; Vladimir Sigunov
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: OSD slow ops warning not clearing after OSD down

Hi Dan,

just restarted all MONs, no change though :(

Thanks for looking at this. I will wait until tomorrow. My plan is to get the 
disk up again with the same OSD ID and would expect that this will eventually 
allow the message to be cleared.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 03 May 2021 15:08:03
To: Vladimir Sigunov
Cc: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ceph-users] Re: OSD slow ops warning not clearing after OSD down

Wait, first just restart the leader mon.

See: https://tracker.ceph.com/issues/47380 for a related issue.

-- dan

On Mon, May 3, 2021 at 2:55 PM Vladimir Sigunov
 wrote:
>
> Hi Frank,
> Yes, I would purge the osd. The cluster looks absolutely healthy except of 
> this osd.584 Probably,  the purge will help the cluster to forget this faulty 
> one. Also, I would restart monitors, too.
> With the amount of data you maintain in your cluster, I don't think your 
> ceph.conf contains any information about some particular osds, but if it 
> does, don't forget to remove the configuration of osd.584 from the ceph.conf
>
> Get Outlook for Android<https://aka.ms/ghei36>
>
> 
> From: Frank Schilder 
> Sent: Monday, May 3, 2021 8:37:09 AM
> To: Vladimir Sigunov ; ceph-users@ceph.io 
> 
> Subject: Re: OSD slow ops warning not clearing after OSD down
>
> Hi Vladimir,
>
> thanks for your reply. I did, the cluster is healthy:
>
> [root@gnosis ~]# ceph status
>   cluster:
> id: ---
> health: HEALTH_WARN
> 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
>   services:
> mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> mgr: ceph-01(active), standbys: ceph-02, ceph-03
> mds: con-fs2-2/2/2 up  {0=ceph-08=up:active,1=ceph-12=up:active}, 2 
> up:standby
> osd: 584 osds: 578 up, 578 in
>
>   data:
> pools:   11 pools, 3215 pgs
> objects: 610.3 M objects, 1.2 PiB
> usage:   1.5 PiB used, 4.6 PiB / 6.0 PiB avail
> pgs: 3191 active+clean
>  13   active+clean+scrubbing+deep
>  9active+clean+snaptrim_wait
>  2active+clean+snaptrim
>
>   io:
> client:   358 MiB/s rd, 56 MiB/s wr, 2.35 kop/s rd, 1.32 kop/s wr
>
> [root@gnosis ~]# ceph health detail
> HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
> SLOW_OPS 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
> OSD 580 is down+out and the message does not even increment the seconds. Its 
> probably stuck in some part of the health checking that tries to query 580 
> and doesn't understand that the OSD being down means there are no ops.
>
> I tried to restart the OSD on this disk, but it seems completely rigged. The 
> iDRAC log on the server says that the disk was removed during operation 
> possibly due to a physical connection fail on the SAS lanes. I somehow need 
> to get rid of this message and am wondering of purging the OSD would help.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Vladimir Sigunov 
> Sent: 03 May 2021 13:45:19
> To: ceph-users@ceph.io; Frank Schilder
> Subject: Re: OSD slow ops warning not clearing after OSD down
>
> Hi Frank.
> Check your cluster for inactive/incomplete placement groups. I saw similar 
> behavior on Octopus when some pgs stuck in incomplete/inactive or peering 
> state.
>
> 
> From: Frank Schilder 
> Sent: Monday, May 3, 2021 3:42:48 AM
> To: ceph-users@ceph.io 
> Subject: [ceph-users] OSD slow ops warning not clearing after OSD down
>
> Dear cephers,
>
> I have a strange problem. An OSD went down and recovery finished. For some 
> reason, I have a slow ops warning for the failed OSD stuck in the system:
>
> health: HEALTH_WARN
> 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
> The OSD is auto-out:
>
> | 580 | ceph-22 |0  |0  |0   | 0   |0   | 0   | 
> autoout,exists |
>
> It is probably a warning dating back to just before the fail. How can I clear 
> it?
>
> Thanks and best regard

[ceph-users] Re: OSD slow ops warning not clearing after OSD down

2021-05-03 Thread Frank Schilder
Hi Dan,

just restarted all MONs, no change though :(

Thanks for looking at this. I will wait until tomorrow. My plan is to get the 
disk up again with the same OSD ID and would expect that this will eventually 
allow the message to be cleared.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 03 May 2021 15:08:03
To: Vladimir Sigunov
Cc: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ceph-users] Re: OSD slow ops warning not clearing after OSD down

Wait, first just restart the leader mon.

See: https://tracker.ceph.com/issues/47380 for a related issue.

-- dan

On Mon, May 3, 2021 at 2:55 PM Vladimir Sigunov
 wrote:
>
> Hi Frank,
> Yes, I would purge the osd. The cluster looks absolutely healthy except of 
> this osd.584 Probably,  the purge will help the cluster to forget this faulty 
> one. Also, I would restart monitors, too.
> With the amount of data you maintain in your cluster, I don't think your 
> ceph.conf contains any information about some particular osds, but if it 
> does, don't forget to remove the configuration of osd.584 from the ceph.conf
>
> Get Outlook for Android<https://aka.ms/ghei36>
>
> 
> From: Frank Schilder 
> Sent: Monday, May 3, 2021 8:37:09 AM
> To: Vladimir Sigunov ; ceph-users@ceph.io 
> 
> Subject: Re: OSD slow ops warning not clearing after OSD down
>
> Hi Vladimir,
>
> thanks for your reply. I did, the cluster is healthy:
>
> [root@gnosis ~]# ceph status
>   cluster:
> id: ---
> health: HEALTH_WARN
> 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
>   services:
> mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> mgr: ceph-01(active), standbys: ceph-02, ceph-03
> mds: con-fs2-2/2/2 up  {0=ceph-08=up:active,1=ceph-12=up:active}, 2 
> up:standby
> osd: 584 osds: 578 up, 578 in
>
>   data:
> pools:   11 pools, 3215 pgs
> objects: 610.3 M objects, 1.2 PiB
> usage:   1.5 PiB used, 4.6 PiB / 6.0 PiB avail
> pgs: 3191 active+clean
>  13   active+clean+scrubbing+deep
>  9active+clean+snaptrim_wait
>  2active+clean+snaptrim
>
>   io:
> client:   358 MiB/s rd, 56 MiB/s wr, 2.35 kop/s rd, 1.32 kop/s wr
>
> [root@gnosis ~]# ceph health detail
> HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
> SLOW_OPS 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
> OSD 580 is down+out and the message does not even increment the seconds. Its 
> probably stuck in some part of the health checking that tries to query 580 
> and doesn't understand that the OSD being down means there are no ops.
>
> I tried to restart the OSD on this disk, but it seems completely rigged. The 
> iDRAC log on the server says that the disk was removed during operation 
> possibly due to a physical connection fail on the SAS lanes. I somehow need 
> to get rid of this message and am wondering of purging the OSD would help.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Vladimir Sigunov 
> Sent: 03 May 2021 13:45:19
> To: ceph-users@ceph.io; Frank Schilder
> Subject: Re: OSD slow ops warning not clearing after OSD down
>
> Hi Frank.
> Check your cluster for inactive/incomplete placement groups. I saw similar 
> behavior on Octopus when some pgs stuck in incomplete/inactive or peering 
> state.
>
> 
> From: Frank Schilder 
> Sent: Monday, May 3, 2021 3:42:48 AM
> To: ceph-users@ceph.io 
> Subject: [ceph-users] OSD slow ops warning not clearing after OSD down
>
> Dear cephers,
>
> I have a strange problem. An OSD went down and recovery finished. For some 
> reason, I have a slow ops warning for the failed OSD stuck in the system:
>
> health: HEALTH_WARN
> 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
> The OSD is auto-out:
>
> | 580 | ceph-22 |0  |0  |0   | 0   |0   | 0   | 
> autoout,exists |
>
> It is probably a warning dating back to just before the fail. How can I clear 
> it?
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD slow ops warning not clearing after OSD down

2021-05-03 Thread Frank Schilder
Hi Vladimir,

thanks for your reply. I did, the cluster is healthy:

[root@gnosis ~]# ceph status
  cluster:
id: ---
health: HEALTH_WARN
430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops

  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-02, ceph-03
mds: con-fs2-2/2/2 up  {0=ceph-08=up:active,1=ceph-12=up:active}, 2 
up:standby
osd: 584 osds: 578 up, 578 in

  data:
pools:   11 pools, 3215 pgs
objects: 610.3 M objects, 1.2 PiB
usage:   1.5 PiB used, 4.6 PiB / 6.0 PiB avail
pgs: 3191 active+clean
 13   active+clean+scrubbing+deep
 9active+clean+snaptrim_wait
 2active+clean+snaptrim

  io:
client:   358 MiB/s rd, 56 MiB/s wr, 2.35 kop/s rd, 1.32 kop/s wr

[root@gnosis ~]# ceph health detail
HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
SLOW_OPS 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops

OSD 580 is down+out and the message does not even increment the seconds. Its 
probably stuck in some part of the health checking that tries to query 580 and 
doesn't understand that the OSD being down means there are no ops.

I tried to restart the OSD on this disk, but it seems completely rigged. The 
iDRAC log on the server says that the disk was removed during operation 
possibly due to a physical connection fail on the SAS lanes. I somehow need to 
get rid of this message and am wondering of purging the OSD would help.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Vladimir Sigunov 
Sent: 03 May 2021 13:45:19
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: OSD slow ops warning not clearing after OSD down

Hi Frank.
Check your cluster for inactive/incomplete placement groups. I saw similar 
behavior on Octopus when some pgs stuck in incomplete/inactive or peering state.


From: Frank Schilder 
Sent: Monday, May 3, 2021 3:42:48 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] OSD slow ops warning not clearing after OSD down

Dear cephers,

I have a strange problem. An OSD went down and recovery finished. For some 
reason, I have a slow ops warning for the failed OSD stuck in the system:

health: HEALTH_WARN
430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops

The OSD is auto-out:

| 580 | ceph-22 |0  |0  |0   | 0   |0   | 0   | 
autoout,exists |

It is probably a warning dating back to just before the fail. How can I clear 
it?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD slow ops warning not clearing after OSD down

2021-05-03 Thread Dan van der Ster
Wait, first just restart the leader mon.

See: https://tracker.ceph.com/issues/47380 for a related issue.

-- dan

On Mon, May 3, 2021 at 2:55 PM Vladimir Sigunov
 wrote:
>
> Hi Frank,
> Yes, I would purge the osd. The cluster looks absolutely healthy except of 
> this osd.584 Probably,  the purge will help the cluster to forget this faulty 
> one. Also, I would restart monitors, too.
> With the amount of data you maintain in your cluster, I don't think your 
> ceph.conf contains any information about some particular osds, but if it 
> does, don't forget to remove the configuration of osd.584 from the ceph.conf
>
> Get Outlook for Android
>
> 
> From: Frank Schilder 
> Sent: Monday, May 3, 2021 8:37:09 AM
> To: Vladimir Sigunov ; ceph-users@ceph.io 
> 
> Subject: Re: OSD slow ops warning not clearing after OSD down
>
> Hi Vladimir,
>
> thanks for your reply. I did, the cluster is healthy:
>
> [root@gnosis ~]# ceph status
>   cluster:
> id: ---
> health: HEALTH_WARN
> 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
>   services:
> mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> mgr: ceph-01(active), standbys: ceph-02, ceph-03
> mds: con-fs2-2/2/2 up  {0=ceph-08=up:active,1=ceph-12=up:active}, 2 
> up:standby
> osd: 584 osds: 578 up, 578 in
>
>   data:
> pools:   11 pools, 3215 pgs
> objects: 610.3 M objects, 1.2 PiB
> usage:   1.5 PiB used, 4.6 PiB / 6.0 PiB avail
> pgs: 3191 active+clean
>  13   active+clean+scrubbing+deep
>  9active+clean+snaptrim_wait
>  2active+clean+snaptrim
>
>   io:
> client:   358 MiB/s rd, 56 MiB/s wr, 2.35 kop/s rd, 1.32 kop/s wr
>
> [root@gnosis ~]# ceph health detail
> HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
> SLOW_OPS 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
> OSD 580 is down+out and the message does not even increment the seconds. Its 
> probably stuck in some part of the health checking that tries to query 580 
> and doesn't understand that the OSD being down means there are no ops.
>
> I tried to restart the OSD on this disk, but it seems completely rigged. The 
> iDRAC log on the server says that the disk was removed during operation 
> possibly due to a physical connection fail on the SAS lanes. I somehow need 
> to get rid of this message and am wondering of purging the OSD would help.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Vladimir Sigunov 
> Sent: 03 May 2021 13:45:19
> To: ceph-users@ceph.io; Frank Schilder
> Subject: Re: OSD slow ops warning not clearing after OSD down
>
> Hi Frank.
> Check your cluster for inactive/incomplete placement groups. I saw similar 
> behavior on Octopus when some pgs stuck in incomplete/inactive or peering 
> state.
>
> 
> From: Frank Schilder 
> Sent: Monday, May 3, 2021 3:42:48 AM
> To: ceph-users@ceph.io 
> Subject: [ceph-users] OSD slow ops warning not clearing after OSD down
>
> Dear cephers,
>
> I have a strange problem. An OSD went down and recovery finished. For some 
> reason, I have a slow ops warning for the failed OSD stuck in the system:
>
> health: HEALTH_WARN
> 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
> The OSD is auto-out:
>
> | 580 | ceph-22 |0  |0  |0   | 0   |0   | 0   | 
> autoout,exists |
>
> It is probably a warning dating back to just before the fail. How can I clear 
> it?
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD slow ops warning not clearing after OSD down

2021-05-03 Thread Vladimir Sigunov
Hi Frank,
Yes, I would purge the osd. The cluster looks absolutely healthy except of this 
osd.584 Probably,  the purge will help the cluster to forget this faulty one. 
Also, I would restart monitors, too.
With the amount of data you maintain in your cluster, I don't think your 
ceph.conf contains any information about some particular osds, but if it does, 
don't forget to remove the configuration of osd.584 from the ceph.conf

Get Outlook for Android


From: Frank Schilder 
Sent: Monday, May 3, 2021 8:37:09 AM
To: Vladimir Sigunov ; ceph-users@ceph.io 

Subject: Re: OSD slow ops warning not clearing after OSD down

Hi Vladimir,

thanks for your reply. I did, the cluster is healthy:

[root@gnosis ~]# ceph status
  cluster:
id: ---
health: HEALTH_WARN
430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops

  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-02, ceph-03
mds: con-fs2-2/2/2 up  {0=ceph-08=up:active,1=ceph-12=up:active}, 2 
up:standby
osd: 584 osds: 578 up, 578 in

  data:
pools:   11 pools, 3215 pgs
objects: 610.3 M objects, 1.2 PiB
usage:   1.5 PiB used, 4.6 PiB / 6.0 PiB avail
pgs: 3191 active+clean
 13   active+clean+scrubbing+deep
 9active+clean+snaptrim_wait
 2active+clean+snaptrim

  io:
client:   358 MiB/s rd, 56 MiB/s wr, 2.35 kop/s rd, 1.32 kop/s wr

[root@gnosis ~]# ceph health detail
HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
SLOW_OPS 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops

OSD 580 is down+out and the message does not even increment the seconds. Its 
probably stuck in some part of the health checking that tries to query 580 and 
doesn't understand that the OSD being down means there are no ops.

I tried to restart the OSD on this disk, but it seems completely rigged. The 
iDRAC log on the server says that the disk was removed during operation 
possibly due to a physical connection fail on the SAS lanes. I somehow need to 
get rid of this message and am wondering of purging the OSD would help.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Vladimir Sigunov 
Sent: 03 May 2021 13:45:19
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: OSD slow ops warning not clearing after OSD down

Hi Frank.
Check your cluster for inactive/incomplete placement groups. I saw similar 
behavior on Octopus when some pgs stuck in incomplete/inactive or peering state.


From: Frank Schilder 
Sent: Monday, May 3, 2021 3:42:48 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] OSD slow ops warning not clearing after OSD down

Dear cephers,

I have a strange problem. An OSD went down and recovery finished. For some 
reason, I have a slow ops warning for the failed OSD stuck in the system:

health: HEALTH_WARN
430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops

The OSD is auto-out:

| 580 | ceph-22 |0  |0  |0   | 0   |0   | 0   | 
autoout,exists |

It is probably a warning dating back to just before the fail. How can I clear 
it?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD slow ops warning not clearing after OSD down

2021-05-03 Thread Vladimir Sigunov
Hi Frank.
Check your cluster for inactive/incomplete placement groups. I saw similar 
behavior on Octopus when some pgs stuck in incomplete/inactive or peering state.


From: Frank Schilder 
Sent: Monday, May 3, 2021 3:42:48 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] OSD slow ops warning not clearing after OSD down

Dear cephers,

I have a strange problem. An OSD went down and recovery finished. For some 
reason, I have a slow ops warning for the failed OSD stuck in the system:

health: HEALTH_WARN
430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops

The OSD is auto-out:

| 580 | ceph-22 |0  |0  |0   | 0   |0   | 0   | 
autoout,exists |

It is probably a warning dating back to just before the fail. How can I clear 
it?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io