Le 12/04/2024 à 12:56:12+0200, Frédéric Nass a écrit
> 
Hi, 

> 
> Have you check the hardware status of the involved drives other than with 
> smartctl? Like with the manufacturer's tools / WebUI (iDrac / perccli for 
> DELL hardware for example).

Yes, all my disk are «under» periodic check with smartctl + icinga. 

> If these tools don't report any media error (that is bad blocs on disks) then 
> you might just be facing the bit rot phenomenon. But this is very rare and 
> should happen in a sysadmin's lifetime as often as a Royal Flush hand in a 
> professional poker player's lifetime. ;-)
> 
> If no media error is reported, then you might want to check and update the 
> firmware of all drives.

You're perfectly right. 

It's just a newbie error, I check on the «main» osd of the PG (meaning the
first in the list) but forget to check on other. 

On when server I indeed get some error on a disk.

But strangely smartctl report nothing. I will add a check with dmesg. 

> 
> Once you figured it out, you may enable osd_scrub_auto_repair=true to have 
> these inconsistencies repaired automatically on deep-scrubbing, but make sure 
> you're using the alert module [1] so to at least get informed about the scrub 
> errors.

Thanks. I will look into because we got already icinga2 on site so I use
icinga2 to check the cluster. 

Is they are a list of what the alert module going to check ? 


Regards

JAS
-- 
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
ven. 12 avril 2024 15:13:13 CEST
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to