All of the other things that I would be looking at would show a link speed
failure. In the two cases of network shenanigans I've had that effectively
broke ceph the link speed was always correct. That leads me to distrust
link speed as a reliable source of truth. Also, it's testing a proxy for
what you actually care about, not the thing you actually care about. Put
another way, I don't care what speed the link negotiates, I care how fast
it actually moves packets.

Link usage seems like a much more interesting metric to add to me, but I
would be concerned about generating a lot of false positives. If my network
is usually 20% utilized, but then spikes to 100% for awhile because of some
legit activity, I don't want to get an alarm for that. Maybe having a rule
where it has a very long normalization period or something so it only
alarms if it's pegged for multiple hours or something. But again, I would
think that in that case there would be other problems that are evident
because of some pathological state on the network. Again, I don't care if
my network is being fully utilized, I care if the network is being utilized
to the point that it's causing IO wait in VMs. Looking at utilization alone
won't tell me that. But again, it could be a good canary for anomaly
detection if your normal state leaves a lot of headroom. I'm torn on this
one.

Also, while we're on the subject, if anyone isn't doing any kind of metric
collection on their ceph networks, I highly recommend installing ganglia.
IT's dead simpel to get going and creates all sorts of useful system-level
graphs that are helpful in locating and identifying trends and problems.

QH

On Mon, Aug 3, 2015 at 9:45 AM, Antonio Messina <antonio.mess...@uzh.ch>
wrote:

> On Mon, Aug 3, 2015 at 5:10 PM, Quentin Hartman
> <qhart...@direwolfdigital.com> wrote:
> > The problem with this kind of monitoring is that there are so many
> possible
> > metrics to watch and so many possible ways to watch them. For myself, I'm
> > working on implementing a couple of things:
> > - Watching error counters on servers
> > - Watching error counters on switches
> > - Watching performance
>
> I would also check:
>
> - link speed (on both servers and switches)
> - link usage (over 80% issue a warning)
>
> .a.
>
> --
> antonio.mess...@uzh.ch
> S3IT: Services and Support for Science IT        http://www.s3it.uzh.ch/
> University of Zurich                             Y12 F 84
> Winterthurerstrasse 190
> CH-8057 Zurich Switzerland
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to