Re: [Nagios-users] Transient errors
Yes, we have to make the same kinds of tweaks in our environment. Sometimes I've had to develop a new plugin or monitor different elements that will alert me to the situation more quickly. Frank -Original Message- From: Andreas Ericsson [mailto:a...@op5.se] Sent: Friday, March 02, 2012 4:15 AM To: Nagios Users List Subject: Re: [Nagios-users] Transient errors On 03/01/2012 10:38 PM, David Dyer-Bennet wrote: > > I see a lot of transient errors on services and hosts I'm monitoring. > Hence finding ways to keep notifications from going out on situations that > will resolve themselves are kind of an issue. > > I've played with how many failures in a row are needed to cause a > notification, and have that set differently for things I'm monitoring > across long links (Beijing, say) compared to things I'm monitoring locally > or in New York. Of course, one problem with that is that it makes it take > longer before a real problem causes a notification. Right now it takes > over 15 minutes for the total failure of our link to Beijing to cause a > notification. > > For things that are numeric values, I can play with the critical and > warning ranges to potentially reduce false positives. That, at least, > doesn't slow down recognition of total failures. Some things just don't > seem to fit the Nagios model -- for example it's quite normal for the SQL > server to pull 100% of the cpu for periods now and then, but if it goes on > too long, *that's* unusual. Hmm; I suppose I could override the number of > failures needed to cause a notification in the service definition for > htose, couldn't I? There may be some things I should just stop monitoring > (there aren't clear-cut "okay" and "bad" behaviors that I can quantify). > > I guess I'm wondering if there are useful basic approaches to handling > this problem that I'm missing, or if I just need to work through the > details more carefully. I'm startled at how often I get isolated > failures for no apparent reason. Is that normal for most people > monitoring services? I think I'm finding my connections time out now and > then due simply to load, without the load actually being at all high. Apart from the great writeup Mark wrote, I'd like to add that you can also set "first_notification_delay" for both hosts and services. That will make the services and hosts appear red and critical in the ui, but it will delay notifications for AT LEAST the specified amount of time (multiplied with interval_length, so usually it means minutes). I've stressed AT LEAST, since first_notification_delay requires that a check is run in order to trigger the notification, so the delay could sometimes be greater than what you specify. Some people are a bit freaked out by that, so you'd best know it before you start using it. -- Andreas Ericsson andreas.erics...@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. -- Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null smime.p7s Description: S/MIME cryptographic signature -- Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Transient errors
On 03/01/2012 10:38 PM, David Dyer-Bennet wrote: > > I see a lot of transient errors on services and hosts I'm monitoring. > Hence finding ways to keep notifications from going out on situations that > will resolve themselves are kind of an issue. > > I've played with how many failures in a row are needed to cause a > notification, and have that set differently for things I'm monitoring > across long links (Beijing, say) compared to things I'm monitoring locally > or in New York. Of course, one problem with that is that it makes it take > longer before a real problem causes a notification. Right now it takes > over 15 minutes for the total failure of our link to Beijing to cause a > notification. > > For things that are numeric values, I can play with the critical and > warning ranges to potentially reduce false positives. That, at least, > doesn't slow down recognition of total failures. Some things just don't > seem to fit the Nagios model -- for example it's quite normal for the SQL > server to pull 100% of the cpu for periods now and then, but if it goes on > too long, *that's* unusual. Hmm; I suppose I could override the number of > failures needed to cause a notification in the service definition for > htose, couldn't I? There may be some things I should just stop monitoring > (there aren't clear-cut "okay" and "bad" behaviors that I can quantify). > > I guess I'm wondering if there are useful basic approaches to handling > this problem that I'm missing, or if I just need to work through the > details more carefully. I'm startled at how often I get isolated > failures for no apparent reason. Is that normal for most people > monitoring services? I think I'm finding my connections time out now and > then due simply to load, without the load actually being at all high. Apart from the great writeup Mark wrote, I'd like to add that you can also set "first_notification_delay" for both hosts and services. That will make the services and hosts appear red and critical in the ui, but it will delay notifications for AT LEAST the specified amount of time (multiplied with interval_length, so usually it means minutes). I've stressed AT LEAST, since first_notification_delay requires that a check is run in order to trigger the notification, so the delay could sometimes be greater than what you specify. Some people are a bit freaked out by that, so you'd best know it before you start using it. -- Andreas Ericsson andreas.erics...@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. -- Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Transient errors
David, I'm afraid I don't have a simple answer for you there. It sounds like you're monitoring some things that are far away network-wise. If this were my environment I would try to setup a distributed Nagios installation with locally situated Nagios servers to monitor services that were local. You could either use Merlin and a poller/NOC setup or possibly something like "Multi-Site" to allow you to see all the different locations from one central location. If you're talking about things like ping for host checks (or better yet 'fping'), then you should be able to adjust the threshold upwards to allow for longer and longer round-trip times. Otherwise, I would say that the approaches you mention are what you generally have to wind up doing. For instance, we try to have standard thresholds for CPU alerting on Windows servers, but we have some reporting servers that can peg the CPU for 30 minutes. So the teams who own those servers have asked us to raise the threshold for hard criticals to 45 consecutive failures (roughly 45 minutes with the way we schedule checks to run). So you kind of have to take each check on a case by case basis. Usually because you saw the failures and dug into what the exact issue was and determined what, if anything, a resolution for that was. One other example. We do some checks of Oracle databases and the way the Oracle client libraries work, if a database is down, the library could make the code wait for 5 minutes before it returns anything. Obviously that's sort of a problem for Nagios in terms of scheduling, latency, and check execution times.So the solution was to modify the code itself to have a timeout that kill the Oracle connection attempt and abort the check script. As a related thing from that last example, if you're using check_nrpe and you're getting timeouts, you *could* increase the timeout value, but again, that has implications to your Nagios server's instance if those run too long. You usually want checks executing quickly. Mark -Original Message- From: David Dyer-Bennet [mailto:d...@dd-b.net] Sent: Thursday, March 01, 2012 4:38 PM To: nagios-users@lists.sourceforge.net Subject: [Nagios-users] Transient errors I see a lot of transient errors on services and hosts I'm monitoring. Hence finding ways to keep notifications from going out on situations that will resolve themselves are kind of an issue. I've played with how many failures in a row are needed to cause a notification, and have that set differently for things I'm monitoring across long links (Beijing, say) compared to things I'm monitoring locally or in New York. Of course, one problem with that is that it makes it take longer before a real problem causes a notification. Right now it takes over 15 minutes for the total failure of our link to Beijing to cause a notification. For things that are numeric values, I can play with the critical and warning ranges to potentially reduce false positives. That, at least, doesn't slow down recognition of total failures. Some things just don't seem to fit the Nagios model -- for example it's quite normal for the SQL server to pull 100% of the cpu for periods now and then, but if it goes on too long, *that's* unusual. Hmm; I suppose I could override the number of failures needed to cause a notification in the service definition for htose, couldn't I? There may be some things I should just stop monitoring (there aren't clear-cut "okay" and "bad" behaviors that I can quantify). I guess I'm wondering if there are useful basic approaches to handling this problem that I'm missing, or if I just need to work through the details more carefully. I'm startled at how often I get isolated failures for no apparent reason. Is that normal for most people monitoring services? I think I'm finding my connections time out now and then due simply to load, without the load actually being at all high. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info -- Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing comp