On Tue, 14 Jan 2014, Lamont Granquist wrote:

On Mon, 13 Jan 2014, Kelvin Ku wrote:
You can eliminate a lot of active checks if you watch the logs for normal activity (you can even setup your alerts so instead of just calling a person, it first does a monitoring probe in case the
traffic had just dropped off)

One thing to remember, your load balancer's test is not testing to see if the product works, just that the webserver works. you need other tests to make sure that all the web hits you are getting
aren't just generating a 'database error, try again later' response ;-)

Yes, the best active checks you have of a webserver or a database are the clients of that service. If they are wrapping all their calls in timers and reporting success and failure and perc99 times, then if you are not getting any failures and the perc99 times are within your SLAs, then the webserver/database is probably up -- in fact that is probably the definition of up or down. Those are the alerts that should be wired up into paging people into action at 3am in the morning.

Then there's trending of resources like disk space and other issues that will become issues if they aren't addressed, but those should be yellow alerts or should flap yellow/green long enough that they can be caught and addressed during normal business hours before they cause an impact.

In a large enough site monitoring stuff like CPU utilization and wiring it up to pagers becomes a tedious job of dealing with false alerts. Often those go off for services that are designed to grind CPU and there's no impact to SLAs. I've generally wound up only displaying CPU grinding hosts (useful information when trying to find the root cause of an outage) but not alerting on CPU and only alerting on actual app performance/availability metrics.

It's important to point out that you should alert on abnormal behavior, not just high utilization.

The example I like to give is that there is some load level that you would want to alert on at 3am Sunday morning because it is so high, but that same level of load should also cause an alert at 10am monday morning, but at that point you want the alert because traffic is so _low_ that that load level means something is very broken.

any threshold that's just "above a given level" is of very limited use. it's a good first step, but it's important to realize that this is only a first step.

David Lang
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to