All of the other things that I would be looking at would show a link speed failure. In the two cases of network shenanigans I've had that effectively broke ceph the link speed was always correct. That leads me to distrust link speed as a reliable source of truth. Also, it's testing a proxy for what you actually care about, not the thing you actually care about. Put another way, I don't care what speed the link negotiates, I care how fast it actually moves packets.
Link usage seems like a much more interesting metric to add to me, but I would be concerned about generating a lot of false positives. If my network is usually 20% utilized, but then spikes to 100% for awhile because of some legit activity, I don't want to get an alarm for that. Maybe having a rule where it has a very long normalization period or something so it only alarms if it's pegged for multiple hours or something. But again, I would think that in that case there would be other problems that are evident because of some pathological state on the network. Again, I don't care if my network is being fully utilized, I care if the network is being utilized to the point that it's causing IO wait in VMs. Looking at utilization alone won't tell me that. But again, it could be a good canary for anomaly detection if your normal state leaves a lot of headroom. I'm torn on this one. Also, while we're on the subject, if anyone isn't doing any kind of metric collection on their ceph networks, I highly recommend installing ganglia. IT's dead simpel to get going and creates all sorts of useful system-level graphs that are helpful in locating and identifying trends and problems. QH On Mon, Aug 3, 2015 at 9:45 AM, Antonio Messina <antonio.mess...@uzh.ch> wrote: > On Mon, Aug 3, 2015 at 5:10 PM, Quentin Hartman > <qhart...@direwolfdigital.com> wrote: > > The problem with this kind of monitoring is that there are so many > possible > > metrics to watch and so many possible ways to watch them. For myself, I'm > > working on implementing a couple of things: > > - Watching error counters on servers > > - Watching error counters on switches > > - Watching performance > > I would also check: > > - link speed (on both servers and switches) > - link usage (over 80% issue a warning) > > .a. > > -- > antonio.mess...@uzh.ch > S3IT: Services and Support for Science IT http://www.s3it.uzh.ch/ > University of Zurich Y12 F 84 > Winterthurerstrasse 190 > CH-8057 Zurich Switzerland >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com