[ceph-users] Re: OSD_TOO_MANY_REPAIRS on random OSDs causing clients to hang

2023-04-26 Thread Thomas Hukkelberg
Hi!

There are no kernel log messages that indicate read errors on the disk, and the 
error is not tied to one specific OSD. The errors so far have been on 7 
different OSDs and when we restart the OSD with errors, the errors appears on 
one of the other OSDs in the same PG; as you can see when restarting osd.34, 
the errors continue to appear on osd.284 which have the same PG

HEALTH_WARN Too many repaired reads on 2 OSDs; 1 slow ops, oldest one blocked 
for 9138 sec, osd.284 has slow ops
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
osd.284 had 172635 reads repaired
osd.34 had 26907 reads repaired
[WRN] SLOW_OPS: 1 slow ops, oldest one blocked for 9138 sec, osd.284 has slow 
ops

Also, the curious thing is that it only occurs in pool id 42...


Only error that we saw on the node that we replaced motherboard on:
[Sat Mar 25 21:49:31 2023] mce: [Hardware Error]: Machine check events logged
[Tue Mar 28 20:00:28 2023] mce: [Hardware Error]: Machine check events logged
[Wed Apr 19 01:50:41 2023] mce: [Hardware Error]: Machine check events logged

mce: [Hardware Error] suggest memory or other type of hardware error as we 
understand it.


--thomas



> 26. apr. 2023 kl. 13:55 skrev Robert Sander :
> 
> On 26.04.23 13:24, Thomas Hukkelberg wrote:
> 
>> [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 1 OSDs
>> osd.34 had 9936 reads repaired
> 
> Are there any messages in the kernel log that indicate this device has read 
> errors? Have you considered replacing the disk?
> 
> Regards
> -- 
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
> 
> https://www.heinlein-support.de
> 
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
> 
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD_TOO_MANY_REPAIRS on random OSDs causing clients to hang

2023-04-26 Thread Joachim Kraftmayer - ceph ambassador

Hello Thomas,
I would strongly recommend you to read the messages on the mailing list 
regarding ceph version 16.2.11,16.2.12 and 16.2.13.


Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 26.04.23 um 13:24 schrieb Thomas Hukkelberg:

Hi all,

Over the last 2 weeks we have experienced several OSD_TOO_MANY_REPAIRS errors 
that we struggle to handle in a non-intrusive manner. Restarting MDS + 
hypervisor that accessed the object in question seems to be the only way we can 
clear the error so we can repair the PG and recover access. Any pointers on how 
to handle this issue in a more gentle way than rebooting the hypervisor and 
failing the MDS would be welcome!


The problem seems to only affect one specific pool (id 42) that is used for 
cephfs_data. This pool is our second cephfs data pool in this cluster. The data 
in the pool is accessible via LXC container via Samba and have the cephfs 
filesystem bind-mounted from hypervisor.

Ceph is recently updated to version 16.2.11 (pacific) -- kernel version is 
5.13.19-6-pve on OSD-hosts/samba-containers and 5.19.17-2-pve on MDS-hosts.


The following warnings are issued:
$ ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release; Too many 
repaired reads on 1 OSDs; Degraded data redundancy: 1/2648430
090 objects degraded (0.000%), 1 pg degraded; 1 slow ops, oldest one blocked 
for 608 sec, osd.34 has slow ops
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability 
release
 mds.hk-cephnode-65(mds.0): Client hk-cephnode-56 failing to respond to 
capability release client_id: 9534859837
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 1 OSDs
 osd.34 had 9936 reads repaired
[WRN] PG_DEGRADED: Degraded data redundancy: 1/2648430090 objects degraded 
(0.000%), 1 pg degraded
 pg 42.e2 is active+recovering+degraded+repair, acting [34,275,284]
[WRN] SLOW_OPS: 1 slow ops, oldest one blocked for 608 sec, osd.34 has slow ops



The logs for OSD.34 are flooded with these messages:
root@hk-cephnode-53:~# tail /var/log/ceph/ceph-osd.34.log
2023-04-26T11:41:00.760+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 missing primary copy of 42:4703efac:::10003d86a99.0001:head, will try 
copies on 275,284
2023-04-26T11:41:00.784+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0x on 
42:4703efac:::10003d86a99.0001:head
2023-04-26T11:41:00.812+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0x on 
42:4703efac:::10003d86a99.0001:head
2023-04-26T11:41:00.812+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 missing primary copy of 42:4703efac:::10003d86a99.0001:head, will try 
copies on 275,284
2023-04-26T11:41:00.824+0200 7f03a821f700 -1 osd.34 1352563 get_health_metrics 
reporting 1 slow ops, oldest is osd_op(client.9534859837.0:20412906 42.e2 
42:4703efac:::10003d86a99.0001:head [read 0~1048576 [307@0] out=1048576b] 
snapc 0=[] RETRY=5 ondisk+retry+read+known_if_redirected e1352553)
2023-04-26T11:41:00.824+0200 7f03a821f700  0 log_channel(cluster) log [WRN] : 1 
slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'qa-cephfs_data' 
: 1 ])
2023-04-26T11:41:00.840+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0x on 
42:4703efac:::10003d86a99.0001:head
2023-04-26T11:41:00.864+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0x on 
42:4703efac:::10003d86a99.0001:head
2023-04-26T11:41:00.864+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 missing primary copy of 42:4703efac:::10003d86a99.0001:head, will try 
copies on 275,284
2023-04-26T11:41:00.888+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0x on 
42:4703efac:::10003d86a99.0001:head



We have tried the following:
  - Restarting the OSD in question clears the error for a few seconds but then 
we also we get OSD_TOO_MANY_REPAIRS on OSDs with PGs that holds the object that 
have blocked I/O.

  - Trying to repair the PG seems to restart every 10 second and not actually 
do anything/progressing. (Is there a way to check repair progress?)

  - Restarting the MDS and hypervisor clears the error (the hypervisor hangs 
for several minutes before timing out). However if the object is requested 
again the error reoccurs. If we don't access the object we are able to 
eventually repair the PG.

  - Occasionally setting the primary-affinity to 0 for the primary OSD in the 
PG clears the error after restarting all affected OSD and we are able to repair 
the PG (unless the object is accessed during recovery) and access to the object 
is OK afterwards.

  - Finding and d

[ceph-users] Re: OSD_TOO_MANY_REPAIRS on random OSDs causing clients to hang

2023-04-26 Thread Robert Sander

On 26.04.23 13:24, Thomas Hukkelberg wrote:


[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 1 OSDs
 osd.34 had 9936 reads repaired


Are there any messages in the kernel log that indicate this device has 
read errors? Have you considered replacing the disk?


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io