[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-11 Thread Konstantin Shalygin
Hi, > On 10 Jan 2023, at 07:10, David Orman wrote: > > We ship all of this to our centralized monitoring system (and a lot more) and > have dashboards/proactive monitoring/alerting with 100PiB+ of Ceph. If you're > running Ceph in production, I believe host-level monitoring is critical, >

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread David Orman
We ship all of this to our centralized monitoring system (and a lot more) and have dashboards/proactive monitoring/alerting with 100PiB+ of Ceph. If you're running Ceph in production, I believe host-level monitoring is critical, above and beyond Ceph level. Things like inlet/outlet temperature,

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread Erik Lindahl
Hi, Good points; however, given that ceph already collects all this statistics, isn't  there any way to set (?) reasonable thresholds and actually have ceph detect the amount of read errors and suggest that a given drive should be replaced? It seems a bit strange that we all should have to

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread Anthony D'Atri
> On Jan 9, 2023, at 17:46, David Orman wrote: > > It's important to note we do not suggest using the SMART "OK" indicator as > the drive being valid. We monitor correctable/uncorrectable error counts, as > you can see a dramatic rise when the drives start to fail. 'OK' will be > reported

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread David Orman
It's important to note we do not suggest using the SMART "OK" indicator as the drive being valid. We monitor correctable/uncorrectable error counts, as you can see a dramatic rise when the drives start to fail. 'OK' will be reported for SMART health long after the drive is throwing many

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread Erik Lindahl
Hi, We too kept seeing this until a few months ago in a cluster with ~400 HDDs, while all the drive SMART statistics was always A-OK. Since we use erasure coding each PG involves up to 10 HDDs. It took us a while to realize we shouldn't expect scrub errors on healthy drives, but eventually we

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread David Orman
"dmesg" on all the linux hosts and look for signs of failing drives. Look at smart data, your HBAs/disk controllers, OOB management logs, and so forth. If you're seeing scrub errors, it's probably a bad disk backing an OSD or OSDs. Is there a common OSD in the PGs you've run the repairs on? On

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread Kuhring, Mathias
Hey all, I'd like to pick up on this topic, since we also see regular scrub errors recently. Roughly one per week for around six weeks now. It's always a different PG and the repair command always helps after a while. But the regular re-occurrence seems it bit unsettling. How to best

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2021-04-01 Thread Szabo, Istvan (Agoda)
Forgot the very minimum entries after the scrub done: 2021-04-01T11:37:43.559539+0700 osd.39 (osd.39) 50 : cluster [DBG] 20.19 repair starts 2021-04-01T11:37:43.889909+0700 osd.39 (osd.39) 51 : cluster [ERR] 20.19 soid 20:990258ea:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.17263260.1.237:head