On Tue, 14 Jan 2014, Lamont Granquist wrote:
On Mon, 13 Jan 2014, Kelvin Ku wrote:
You can eliminate a lot of active checks if you watch the logs for normal
activity (you can even
setup your alerts so instead of just calling a person, it first does a
monitoring probe in case the
traffic had just dropped off)
One thing to remember, your load balancer's test is not testing to see if
the product works, just
that the webserver works. you need other tests to make sure that all the
web hits you are getting
aren't just generating a 'database error, try again later' response ;-)
Yes, the best active checks you have of a webserver or a database are the
clients of that service. If they are wrapping all their calls in timers and
reporting success and failure and perc99 times, then if you are not getting
any failures and the perc99 times are within your SLAs, then the
webserver/database is probably up -- in fact that is probably the definition
of up or down. Those are the alerts that should be wired up into paging
people into action at 3am in the morning.
Then there's trending of resources like disk space and other issues that will
become issues if they aren't addressed, but those should be yellow alerts or
should flap yellow/green long enough that they can be caught and addressed
during normal business hours before they cause an impact.
In a large enough site monitoring stuff like CPU utilization and wiring it up
to pagers becomes a tedious job of dealing with false alerts. Often those go
off for services that are designed to grind CPU and there's no impact to
SLAs. I've generally wound up only displaying CPU grinding hosts (useful
information when trying to find the root cause of an outage) but not alerting
on CPU and only alerting on actual app performance/availability metrics.
It's important to point out that you should alert on abnormal behavior, not just
high utilization.
The example I like to give is that there is some load level that you would want
to alert on at 3am Sunday morning because it is so high, but that same level of
load should also cause an alert at 10am monday morning, but at that point you
want the alert because traffic is so _low_ that that load level means something
is very broken.
any threshold that's just "above a given level" is of very limited use. it's a
good first step, but it's important to realize that this is only a first step.
David Lang
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/