On Mon, 13 Jan 2014, Kelvin Ku wrote:
You can eliminate a lot of active checks if you watch the logs for normal 
activity (you can even
setup your alerts so instead of just calling a person, it first does a 
monitoring probe in case the
traffic had just dropped off)

One thing to remember, your load balancer's test is not testing to see if the 
product works, just
that the webserver works. you need other tests to make sure that all the web 
hits you are getting
aren't just generating a 'database error, try again later' response ;-)

Yes, the best active checks you have of a webserver or a database are the clients of that service. If they are wrapping all their calls in timers and reporting success and failure and perc99 times, then if you are not getting any failures and the perc99 times are within your SLAs, then the webserver/database is probably up -- in fact that is probably the definition of up or down. Those are the alerts that should be wired up into paging people into action at 3am in the morning.

Then there's trending of resources like disk space and other issues that will become issues if they aren't addressed, but those should be yellow alerts or should flap yellow/green long enough that they can be caught and addressed during normal business hours before they cause an impact.

In a large enough site monitoring stuff like CPU utilization and wiring it up to pagers becomes a tedious job of dealing with false alerts. Often those go off for services that are designed to grind CPU and there's no impact to SLAs. I've generally wound up only displaying CPU grinding hosts (useful information when trying to find the root cause of an outage) but not alerting on CPU and only alerting on actual app performance/availability metrics.
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to