I've been graphing disk latency, osd
latency, and RGW latency. It's a bit tricky to pull out of ceph
--admin-daemon ceph-osd.0.asok perf dump though. perf dump
gives you the total ops and total op time. You have to track the
delta of those two values, then divide the deltas to get the
average latency over your sample interval.
I had some alerting on those values, but it was too noisy. The graphs are helpful though, especially the graphs that have all of a single node's disks (one graph) and OSDs (second graph) on it. Viewing both graphs helped me identify several problems, including a failing disk and a bad write cache battery. I'm not getting much out of the RGW latency graph though. It's pretty much just the sum of all the OSD latency graphs during that sample interval.
|
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com