Re: [lopsa-discuss] Metrics vs Monitoring...

David Lang Tue, 14 Jan 2014 16:02:05 -0800

On Tue, 14 Jan 2014, Lamont Granquist wrote:

On Mon, 13 Jan 2014, Kelvin Ku wrote:
You can eliminate a lot of active checks if you watch the logs for normalactivity (you can evensetup your alerts so instead of just calling a person, it first does amonitoring probe in case the
traffic had just dropped off)
One thing to remember, your load balancer's test is not testing to see ifthe product works, justthat the webserver works. you need other tests to make sure that all theweb hits you are getting
aren't just generating a 'database error, try again later' response ;-)
Yes, the best active checks you have of a webserver or a database are theclients of that service. If they are wrapping all their calls in timers andreporting success and failure and perc99 times, then if you are not gettingany failures and the perc99 times are within your SLAs, then thewebserver/database is probably up -- in fact that is probably the definitionof up or down. Those are the alerts that should be wired up into pagingpeople into action at 3am in the morning.
Then there's trending of resources like disk space and other issues that willbecome issues if they aren't addressed, but those should be yellow alerts orshould flap yellow/green long enough that they can be caught and addressedduring normal business hours before they cause an impact.
In a large enough site monitoring stuff like CPU utilization and wiring it upto pagers becomes a tedious job of dealing with false alerts. Often those gooff for services that are designed to grind CPU and there's no impact toSLAs. I've generally wound up only displaying CPU grinding hosts (usefulinformation when trying to find the root cause of an outage) but not alertingon CPU and only alerting on actual app performance/availability metrics.

It's important to point out that you should alert on abnormal behavior, not justhigh utilization.

The example I like to give is that there is some load level that you would wantto alert on at 3am Sunday morning because it is so high, but that same level ofload should also cause an alert at 10am monday morning, but at that point youwant the alert because traffic is so _low_ that that load level means somethingis very broken.

any threshold that's just "above a given level" is of very limited use. it's agood first step, but it's important to realize that this is only a first step.


David Lang
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Re: [lopsa-discuss] Metrics vs Monitoring...

Reply via email to