Hi,
> On 10 Jan 2023, at 07:10, David Orman wrote:
>
> We ship all of this to our centralized monitoring system (and a lot more) and
> have dashboards/proactive monitoring/alerting with 100PiB+ of Ceph. If you're
> running Ceph in production, I believe host-level monitoring is critical,
>
We ship all of this to our centralized monitoring system (and a lot more) and
have dashboards/proactive monitoring/alerting with 100PiB+ of Ceph. If you're
running Ceph in production, I believe host-level monitoring is critical, above
and beyond Ceph level. Things like inlet/outlet temperature,
Hi,
Good points; however, given that ceph already collects all this statistics,
isn't there any way to set (?) reasonable thresholds and actually have ceph
detect the amount of read errors and suggest that a given drive should be
replaced?
It seems a bit strange that we all should have to
> On Jan 9, 2023, at 17:46, David Orman wrote:
>
> It's important to note we do not suggest using the SMART "OK" indicator as
> the drive being valid. We monitor correctable/uncorrectable error counts, as
> you can see a dramatic rise when the drives start to fail. 'OK' will be
> reported
It's important to note we do not suggest using the SMART "OK" indicator as the
drive being valid. We monitor correctable/uncorrectable error counts, as you
can see a dramatic rise when the drives start to fail. 'OK' will be reported
for SMART health long after the drive is throwing many
Hi,
We too kept seeing this until a few months ago in a cluster with ~400 HDDs,
while all the drive SMART statistics was always A-OK. Since we use erasure
coding each PG involves up to 10 HDDs.
It took us a while to realize we shouldn't expect scrub errors on healthy
drives, but eventually we
"dmesg" on all the linux hosts and look for signs of failing drives. Look at
smart data, your HBAs/disk controllers, OOB management logs, and so forth. If
you're seeing scrub errors, it's probably a bad disk backing an OSD or OSDs.
Is there a common OSD in the PGs you've run the repairs on?
On
Hey all,
I'd like to pick up on this topic, since we also see regular scrub
errors recently.
Roughly one per week for around six weeks now.
It's always a different PG and the repair command always helps after a
while.
But the regular re-occurrence seems it bit unsettling.
How to best
Forgot the very minimum entries after the scrub done:
2021-04-01T11:37:43.559539+0700 osd.39 (osd.39) 50 : cluster [DBG] 20.19 repair
starts
2021-04-01T11:37:43.889909+0700 osd.39 (osd.39) 51 : cluster [ERR] 20.19 soid
20:990258ea:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.17263260.1.237:head