We ship all of this to our centralized monitoring system (and a lot more) and 
have dashboards/proactive monitoring/alerting with 100PiB+ of Ceph. If you're 
running Ceph in production, I believe host-level monitoring is critical, above 
and beyond Ceph level. Things like inlet/outlet temperature, hardware state of 
various components, and various other details are probably best served by 
monitoring external to Ceph itself.

I did a quick glance and didn't see this data (OSD errors re: reads/writes) 
exposed in the Pacific version of Ceph's Prometheus-style exporter, but I may 
have overlooked it. This would be nice to have, as well, if it does not exist.

We collect drive counters at the host level, and alert at levels prior to 
general impact. Even a failing drive can cause latency spikes which are 
frustrating, before it starts returning errors (correctable errors) - the OSD 
will not see these other than longer latency on operations. Seeing a change in 
the smart counters either at a high rate or above thresholds you define is most 
certainly something I would suggest ensuring is covered in whatever host-level 
monitoring you're already performing for production usage.

David

On Mon, Jan 9, 2023, at 17:46, Erik Lindahl wrote:
> Hi,
> 
> Good points; however, given that ceph already collects all this statistics, 
> isn't  there any way to set (?) reasonable thresholds and actually have ceph 
> detect the amount of read errors and suggest that a given drive should be 
> replaced?
> 
> It seems a bit strange that we all should have to wait for a PG read error, 
> then log into each node to check the number of read errors for each device 
> and keep track of this?  Of course it's possible to write scripts for 
> everything, but there must be numerous Ceph sites with hundreds of OSD nodes, 
> so I'm a bit surprised this isn't more automated...
> 
> Cheers,
> 
> Erik
> 
> --
> Erik Lindahl <erik.lind...@gmail.com>
> On 10 Jan 2023 at 00:09 +0100, Anthony D'Atri <a...@dreamsnake.net>, wrote:
> >
> >
> > > On Jan 9, 2023, at 17:46, David Orman <orma...@corenode.com> wrote:
> > >
> > > It's important to note we do not suggest using the SMART "OK" indicator 
> > > as the drive being valid. We monitor correctable/uncorrectable error 
> > > counts, as you can see a dramatic rise when the drives start to fail. 
> > > 'OK' will be reported for SMART health long after the drive is throwing 
> > > many uncorrectable errors and needs replacement. You have to look at the 
> > > actual counters, themselves.
> >
> > I strongly agree, especially given personal experience with SSD firmware 
> > design flaws.
> >
> > Also, examining UDMA / CRC error rates led to the discovery that certain 
> > aftermarket drive carriers had lower tolerances than those from the chassis 
> > vendor, resulting in drives that were silently slow. Reseating in most 
> > cases restored performance.
> >
> > — aad
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to