[Nagios-users] Timeouts often false alarms and how to fix
Hi, I have a few fc16 boxes and often times alerts are generated for timeouts while connecting to a remote service on a client such as clamd or smtpd: PROCS CRITICAL: 0 processes with command name 'smtpd', UID = 89 (postfix) Why is it that this happens? It doesn't happen all the time. I know there are processes running with this command name, so not sure why this alert would be generated. It most often happens for me with smtpd and clamd. Is there a way to extend the timeout, or enable some further debugging to troubleshoot this? Thanks,. Alex -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Timeouts for send_nsca program
In /var/log/nagios/nagios.log on (at least) one of my slave servers, I am seeing messages like: [1191905959] Warning: OCSP command '/usr/lib/nagios/plugins/tier1/submit_check_result.sh HOST SERVICE_CHECK OK MESSAGE for service SERVICE NAME on host HOST timed out after 5 seconds There have been 712 occurrences today (so far). Can anyone offer an explanation ? As far as I can tell there is no configuration to increase the timeout limit (can it be increased by installing from source ?), but perhaps the message indicates another problem (network ?) Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Timeouts for send_nsca program
-Original Message- From: [EMAIL PROTECTED] [mailto:nagios-users- [EMAIL PROTECTED] On Behalf Of Wheeler, JF (Jonathan) Sent: Tuesday, October 09, 2007 10:24 AM To: nagios-users@lists.sourceforge.net Subject: [Nagios-users] Timeouts for send_nsca program In /var/log/nagios/nagios.log on (at least) one of my slave servers, I am seeing messages like: [1191905959] Warning: OCSP command '/usr/lib/nagios/plugins/tier1/submit_check_result.sh HOST SERVICE_CHECK OK MESSAGE for service SERVICE NAME on host HOST timed out after 5 seconds There have been 712 occurrences today (so far). Can anyone offer an explanation ? Something is causing a delay when your submit_check_result.sh script presumably attempts to use send_nsca to send a small bit of data to another host. Can you replicate it from the command line? You should be able to. As far as I can tell there is no configuration to increase the timeout limit (can it be increased by installing from ocsp_timeout in nagios.cfg. source ?), but perhaps the message indicates another problem (network ?) Probably. If it's close to the documented submit_check_result, it's sending only a small bit of data and should only take a fraction of a second. I can use it to send data to two hosts in 0.011s. Maybe network problems or load on either end are causing a delay in the data transmission. -- Marc - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Timeouts for send_nsca program
Make sure NSCA is still accepting the results from send_nsca. to do this send over a result from the command line. reviewing your submit_check_result.sh should give you the connection details of the destination nsca, and the send_nsca --help should assist with the commanad line, This is assuming that it used to work, and would only detect that it is currently broken. but not why it broke nor how to fix it. if it is intermittent, you can trial and error to find a good Timeout value using the command line. other people would be better at diagnosing the cause of the issue. Tony (author of NC_Net) On 10/9/07, Wheeler, JF (Jonathan) [EMAIL PROTECTED] wrote: In /var/log/nagios/nagios.log on (at least) one of my slave servers, I am seeing messages like: [1191905959] Warning: OCSP command '/usr/lib/nagios/plugins/tier1/submit_check_result.sh HOST SERVICE_CHECK OK MESSAGE for service SERVICE NAME on host HOST timed out after 5 seconds There have been 712 occurrences today (so far). Can anyone offer an explanation ? As far as I can tell there is no configuration to increase the timeout limit (can it be increased by installing from source ?), but perhaps the message indicates another problem (network ?) Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] timeouts when using secondary dns
Hi, I normaly configure hostaddresses by their ip-adresses. If it's not possible I always use nscd as very very simple name server cache. Cheers, Gerd Am Freitag, den 10.11.2006, 11:25 +1300 schrieb Steve Shipway: We dealt with this by installing a local caching-only nameserver on the Nagios host itself. This also took a lot of the load off of the main nameservers. So, resolv.conf was set to use 127.0.0.1 by default and have our normal name servers as secondaries. A nice sideeffect was that it vastly sped up the name resolution. Steve -- Steve Shipway ITSS, University of Auckland (09) 3737 599 x 86487 [EMAIL PROTECTED] __ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of stucky Sent: Friday, 10 November 2006 6:57 a.m. To: Az Cc: nagios Subject: Re: [Nagios-users] timeouts when using secondary dns Yey !! That totally did it. Thx AZ I hadn't even considered messing with the resolver cuz I was sure it was a nagios issue so I had to fix nagios. If that wasn't a text book example of how well mailinglists can work then I don't know what is... thx On 11/7/06, Az [EMAIL PROTECTED] wrote: stucky wrote: I use the check_by_ssh plugin for most of my stuff and I noticed that if the primary nameserver is unavailable nagios starts freaking out. All of a sudden all plugins time out. I tested it using the 'host' command and it only takes about 1 second longer to lookup hosts using the secondary nameserver. The default timeout for check_by_ssh is 10 seconds. I cranked it up to 30 and still I get timeouts. I'm not sure I understand that one. Has anyone else seen this. We had a similar issue in that our primary DNS was doing strange things, and it quite often took 5 or even 10 seconds to perform a DNS lookup. What we were seeing was 70% of service checks (and subsequently host checks) failing by timing out. The key was the multiple of 5 seconds. The resolver timeout on, say, RHEL3 is based on RES_TIMEOUT in resolv.h... which was 5 seconds. We added the following to our resolv.conf , and found the problems went away: options timeout:2 rotate This sets the timeout for waiting for a reply to 2 seconds, and tells the resolve to rotate through your 'nameserver' entries rather than always hitting #1, then #2, etc. Cheers. -- stucky - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] timeouts when using secondary dns
Wedealt withthis by installing a local caching-only nameserver on the Nagios host itself. This also took a lot of the load off of the main nameservers. So, resolv.conf was set to use 127.0.0.1 by default and have our normal name servers as secondaries. A nice sideeffect was that it vastly sped up the name resolution. Steve --Steve ShipwayITSS, University of Auckland(09) 3737 599 x 86487[EMAIL PROTECTED] From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of stuckySent: Friday, 10 November 2006 6:57 a.m.To: AzCc: nagiosSubject: Re: [Nagios-users] timeouts when using secondary dns Yey !! That totally did it. Thx AZ I hadn't even considered messing with the resolver cuz I was sure it was a nagios issue so I had to fix nagios.If that wasn't a text book example of how well mailinglists can work then I don't know what is... thx On 11/7/06, Az [EMAIL PROTECTED] wrote: stucky wrote: I use the check_by_ssh plugin for most of my stuff and I noticed that if the primary nameserver is unavailable nagios starts freaking out. All of a sudden all plugins time out. I tested it using the 'host' command and it only takes about 1 second longer to lookup hosts using the secondary nameserver. The default timeout for check_by_ssh is 10 seconds. I cranked it up to 30 and still I get timeouts. I'm not sure I understand that one. Has anyone else seen this.We had a similar issue in that our primary DNS was doing strange things,and it quite often took 5 or even 10 seconds to perform a DNS lookup.What we were seeing was 70% of service checks (and subsequently host checks) failing by timing out. The key was the multiple of 5 seconds.The resolver timeout on, say, RHEL3 is based on RES_TIMEOUT inresolv.h... which was 5 seconds.We added the following to our resolv.conf , and found the problems went away:options timeout:2 rotateThis sets the timeout for waiting for a reply to 2 seconds, and tellsthe resolve to rotate through your 'nameserver' entries rather thanalways hitting #1, then #2, etc.Cheers.-- stucky - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] timeouts when using secondary dns
Hi!Just to let you know that I've made a change to CVS today, reported by Pawel Malachowski, where it looked like the plugins were making too many calls to resolver/DNS when the plugins were compiled with IPv6 options enabled.This should reduce the occasions of timeouts. However, I do like the idea of making the Nagios server a caching name server too...TonOn 9 Nov 2006, at 22:25, Steve Shipway wrote: We dealt with this by installing a local caching-only nameserver on the Nagios host itself. This also took a lot of the load off of the main nameservers. So, resolv.conf was set to use 127.0.0.1 by default and have our normal name servers as secondaries. A nice sideeffect was that it vastly sped up the name resolution. Steve --Steve ShipwayITSS, University of Auckland(09) 3737 599 x 86487[EMAIL PROTECTED] From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of stuckySent: Friday, 10 November 2006 6:57 a.m.To: AzCc: nagiosSubject: Re: [Nagios-users] timeouts when using secondary dns Yey !! That totally did it. Thx AZ I hadn't even considered messing with the resolver cuz I was sure it was a nagios issue so I had to fix nagios.If that wasn't a text book example of how well mailinglists can work then I don't know what is... thx On 11/7/06, Az [EMAIL PROTECTED] wrote: stucky wrote: I use the check_by_ssh plugin for most of my stuff and I noticed that if the primary nameserver is unavailable nagios starts freaking out. All of a sudden all plugins time out. I tested it using the 'host' command and it only takes about 1 second longer to lookup hosts using the secondary nameserver. The default timeout for check_by_ssh is 10 seconds. I cranked it up to 30 and still I get timeouts. I'm not sure I understand that one. Has anyone else seen this.We had a similar issue in that our primary DNS was doing strange things,and it quite often took 5 or even 10 seconds to perform a DNS lookup.What we were seeing was 70% of service checks (and subsequently host checks) failing by timing out. The key was the multiple of 5 seconds.The resolver timeout on, say, RHEL3 is based on RES_TIMEOUT inresolv.h... which was 5 seconds.We added the following to our resolv.conf , and found the problems went away:options timeout:2 rotateThis sets the timeout for waiting for a reply to 2 seconds, and tellsthe resolve to rotate through your 'nameserver' entries rather thanalways hitting #1, then #2, etc.Cheers.-- stucky This message has been scanned for viruses by MailController.-Using Tomcat but need to do more? Need to support web services, security?Get stuff done quickly with pre-integrated technology to make your job easierDownload IBM WebSphere Application Server v.1.0.1 based on Apache Geronimohttp://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642___Nagios-users mailing listNagios-users@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/nagios-users::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null http://www.altinity.comT: +44 (0)870 787 9243F: +44 (0)845 280 1725Skype: tonvoon - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] timeouts when using secondary dns
stucky wrote: I use the check_by_ssh plugin for most of my stuff and I noticed that if the primary nameserver is unavailable nagios starts freaking out. All of a sudden all plugins time out. I tested it using the 'host' command and it only takes about 1 second longer to lookup hosts using the secondary nameserver. The default timeout for check_by_ssh is 10 seconds. I cranked it up to 30 and still I get timeouts. I'm not sure I understand that one. Has anyone else seen this. We had a similar issue in that our primary DNS was doing strange things, and it quite often took 5 or even 10 seconds to perform a DNS lookup. What we were seeing was 70% of service checks (and subsequently host checks) failing by timing out. The key was the multiple of 5 seconds. The resolver timeout on, say, RHEL3 is based on RES_TIMEOUT in resolv.h... which was 5 seconds. We added the following to our resolv.conf, and found the problems went away: options timeout:2 rotate This sets the timeout for waiting for a reply to 2 seconds, and tells the resolve to rotate through your 'nameserver' entries rather than always hitting #1, then #2, etc. Cheers. - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] timeouts when using secondary dns
GuysI use the check_by_ssh plugin for most of my stuff and I noticed that if the primary nameserver is unavailable nagios starts freaking out.All of a sudden all plugins time out. I tested it using the 'host' command and it only takes about 1 second longer to lookup hosts using the secondary nameserver. The default timeout for check_by_ssh is 10 seconds. I cranked it up to 30 and still I get timeouts. I'm not sure I understand that one.Has anyone else seen this.-- stucky - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] timeouts when using secondary dns
On Mon, 6 Nov 2006, stucky wrote: Guys I use the check_by_ssh plugin for most of my stuff and I noticed that if the primary nameserver is unavailable nagios starts freaking out. All of a sudden all plugins time out. I tested it using the 'host' command and it only takes about 1 second longer to lookup hosts using the secondary nameserver. The default timeout for check_by_ssh is 10 seconds. I cranked it up to 30 and still I get timeouts. I'm not sure I understand that one. You can try yo add the nagios host to the host file of one of the servers you are monitoring to see which end suffers the most from a freaked out DNS. Hugo. -- [EMAIL PROTECTED] http://hvdkooij.xs4all.nl/ This message is using 100% recycled electrons. - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] timeouts and performance info
Hi! I have the following values in my nagios.cfg: service_check_timeout=60 host_check_timeout=30 event_handler_timeout=30 notification_timeout=30 ocsp_timeout=5 perfdata_timeout=5 As far as I know, those values are in seconds. What I wonder is why I still have Service and Host Checks that take longer than fifteen minutes to complete. This shouldn't be the case the way I under stand it. Here's my curren perf info: Active Service Checks: = 1 minute:81 (4.6%) = 5 minutes: 1719 (97.4%) = 15 minutes: 1727 (97.9%) = 1 hour: 1727 (97.9%) Since program start:1727 (97.9%) and Check Execution Time: 0.00 sec12.92 sec 0.275 sec Check Latency: 0.00 sec204.30 sec 3.043 sec Percent State Change: 0.00% 15.46% 0.02% Active Hosts Checks: = 1 minute:0 (0.0%) = 5 minutes: 3 (1.2%) = 15 minutes: 3 (1.2%) = 1 hour: 4 (1.6%) Since program start:27 (10.8%) and Check Execution Time: 0.02 sec10.05 sec 0.208 sec Check Latency: 0.00 sec17.48 sec 0.204 sec Percent State Change: 0.00% 0.00% 0.00% Am I the only one seeing a discrepancy here? The only way I can make sense of this is that the = 15 minutes means time from being scheduled to actually starting the plugin. In that case I wonder what makes it take so long, the machine should be beefy neough (dual PIV Xeon 2.8Ghz, 2G of RAM). Any hints/thoughts are appreciated. Regards, Tobias - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] timeouts and performance info
-Original Message- From: [EMAIL PROTECTED] [mailto:nagios-users- [EMAIL PROTECTED] On Behalf Of Tobias Klausmann Sent: Wednesday, August 30, 2006 2:55 AM To: nagios-users@lists.sourceforge.net Subject: [Nagios-users] timeouts and performance info Hi! I have the following values in my nagios.cfg: service_check_timeout=60 host_check_timeout=30 event_handler_timeout=30 notification_timeout=30 ocsp_timeout=5 perfdata_timeout=5 As far as I know, those values are in seconds. What I wonder is why I still have Service and Host Checks that take longer than fifteen minutes to complete. This shouldn't be the case the way I under stand it. Here's my curren perf info: The timeouts above apply from when a particular plugin starts to when it completes (check execution time). As noted below, this time on average for you is 12.92 seconds. They don't affect when a plugin is scheduled to run. Active Service Checks: = 1 minute: 81 (4.6%) = 5 minutes: 1719 (97.4%) = 15 minutes:1727 (97.9%) = 1 hour:1727 (97.9%) Since program start: 1727 (97.9%) This seems mostly normal for a 5 minute check_interval. The small difference between the 5 and 15 minute counts is normal as checks may be just starting to execute or still in progress at the 5 minute mark. It does appear that you have some number of services that are not scheduled for execution or are executing at really long intervals. Look at Service Detail and sort by last check. Re-examine your configuration for those services that do not appear to be scheduled properly. and Check Execution Time: 0.00 sec12.92 sec 0.275 sec Check Latency:0.00 sec204.30 sec 3.043 sec Percent State Change: 0.00% 15.46% 0.02% Looks pretty good to me. The high max check latency number may have been a one-off event. If that number regularly changes and is always very high then you might want to verify that you're not starving nagios for check by running /path/to/nagios/bin/nagios -s /path/to/nagios/etc/nagios and make sure you meet or exceed it's recommended values. Active Hosts Checks: = 1 minute: 0 (0.0%) = 5 minutes: 3 (1.2%) = 15 minutes:3 (1.2%) = 1 hour:4 (1.6%) Since program start: 27 (10.8%) and Check Execution Time: 0.02 sec10.05 sec 0.208 sec Check Latency:0.00 sec17.48 sec 0.204 sec Percent State Change: 0.00% 0.00% 0.00% These look normal and expected. You've had 27 service failures since program start necessitating host checks. Am I the only one seeing a discrepancy here? The only discrepancy I see is likely due to configuration. You probably have check intervals or timeperiods misconfigured for ~30 services. The only way I can make sense of this is that the = 15 minutes means time from being scheduled to actually starting the plugin. In that case I wonder what makes it take so long, the Check Latency is that number. On average nagios is able to run your checks within 3.043 seconds of when they are scheduled to run. The number you are referring to is just a simple count of the number of plugins that have been run in that time interval. -- Marc - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] timeouts and performance info
Hi! On Wed, 30 Aug 2006, Marc Powell wrote: Active Service Checks: = 1 minute:81 (4.6%) = 5 minutes: 1719 (97.4%) = 15 minutes: 1727 (97.9%) = 1 hour: 1727 (97.9%) Since program start:1727 (97.9%) This seems mostly normal for a 5 minute check_interval. The small difference between the 5 and 15 minute counts is normal as checks may be just starting to execute or still in progress at the 5 minute mark. It does appear that you have some number of services that are not scheduled for execution or are executing at really long intervals. Look at Service Detail and sort by last check. Re-examine your configuration for those services that do not appear to be scheduled properly. I have a few services that are disabled entirely (don't check actively, don't accept passive checks). Would they count in the above statistic? They seem to fit in with the missing 2.1% (100-97.9). Also, I saw a few checks that were last run about ~20 minutes ago. Those are log checks via NRPE that complete within 1s (no noticeable delay) if run directly on the machine (as user nagios of course). It seems acceptable (and I neither know why it would take 20 minutes nor how to find out why), so I'm willing to let it slide ;). Looks pretty good to me. The high max check latency number may have been a one-off event. If that number regularly changes and is always very high then you might want to verify that you're not starving nagios for check by running /path/to/nagios/bin/nagios -s /path/to/nagios/etc/nagios and make sure you meet or exceed it's recommended values. I guessed as much for the one-off event. It doesn't change, so I feel somewhat safe. As for the recommended values (-s), Nagios says it's okay the way it is. Active Hosts Checks: = 1 minute:0 (0.0%) = 5 minutes: 3 (1.2%) = 15 minutes: 3 (1.2%) = 1 hour: 4 (1.6%) Since program start:27 (10.8%) and Check Execution Time: 0.02 sec10.05 sec 0.208 sec Check Latency: 0.00 sec17.48 sec 0.204 sec Percent State Change: 0.00% 0.00% 0.00% These look normal and expected. You've had 27 service failures since program start necessitating host checks. That is in line with what I'd expect. Am I the only one seeing a discrepancy here? The only discrepancy I see is likely due to configuration. You probably have check intervals or timeperiods misconfigured for ~30 services. About that number of services are disabled entirely right now, so if they count into the statistic, it explains the figures. The only way I can make sense of this is that the = 15 minutes means time from being scheduled to actually starting the plugin. In that case I wonder what makes it take so long, the Check Latency is that number. On average nagios is able to run your checks within 3.043 seconds of when they are scheduled to run. The number you are referring to is just a simple count of the number of plugins that have been run in that time interval. So it means in the last N minutes, this many services completed and *not* this many services needed N minutes to complete (from being started to delivering the retval)? That would be an eye opener for me :) Regards Thanks, Tobias -- You don't need eyes to see, you need vision. - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] timeouts and performance info
-Original Message- From: [EMAIL PROTECTED] [mailto:nagios-users- [EMAIL PROTECTED] On Behalf Of Tobias Klausmann Sent: Wednesday, August 30, 2006 8:44 AM To: nagios-users@lists.sourceforge.net Subject: Re: [Nagios-users] timeouts and performance info Hi! On Wed, 30 Aug 2006, Marc Powell wrote: Active Service Checks: = 1 minute: 81 (4.6%) = 5 minutes: 1719 (97.4%) = 15 minutes:1727 (97.9%) = 1 hour:1727 (97.9%) Since program start: 1727 (97.9%) I have a few services that are disabled entirely (don't check actively, don't accept passive checks). Would they count in the above statistic? They seem to fit in with the missing 2.1% (100-97.9). Also, I saw a few checks that were last run about ~20 Without reviewing the code, that is what I expect to be the case. Check Latency is that number. On average nagios is able to run your checks within 3.043 seconds of when they are scheduled to run. The number you are referring to is just a simple count of the number of plugins that have been run in that time interval. So it means in the last N minutes, this many services completed and *not* this many services needed N minutes to complete (from being started to delivering the retval)? That would be an eye opener for me :) That is a correct interpretation. -- Marc - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Timeouts
Hi folks, I have a problem concerning timeouts. First the basics: I run Nagios 2.3.1 on Debian Sarge stable. I have configured service_check_timeout=60, but in certain circumstances (e.g. slow dns) I get the erorr: Plugin timed out after 10 seconds or Socket timed out after 10 seconds. Is there another timeout value I have to configure to get rid of this 10 seconds threshold? I know that I should work on my dns first, but I want to understand what decisions Nagios makes there. Any hint or help is appreciated. Dirk - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Timeouts
-Original Message- From: [EMAIL PROTECTED] [mailto:nagios-users- [EMAIL PROTECTED] On Behalf Of Dirk H. Schulz Sent: Wednesday, August 23, 2006 4:43 AM To: nagios-users@lists.sourceforge.net Subject: [Nagios-users] Timeouts Hi folks, I have a problem concerning timeouts. First the basics: I run Nagios 2.3.1 on Debian Sarge stable. I have configured service_check_timeout=60, but in certain circumstances (e.g. slow dns) I get the erorr: Plugin timed out after 10 seconds or Socket timed out after 10 seconds. Is there another timeout value I have to configure to get rid of this 10 seconds threshold? Yes, The service_check_timout in nagios.cfg is a last-resort timeout. If a plugin hasn't terminated itself in that period of time then nagios will kill it. All of the standard plugins (I believe) can be passed a timeout value in their command line, usually via -t. If none is passed they'll use whatever value is hard coded (usually 10 seconds). You can use '--help' for the plugins you use to see the timeout parameters. I know that I should work on my dns first, but I want to understand what decisions Nagios makes there. Or use IP's instead of names so you don't rely on an external service that can possibly fail. ;) -- marc - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Timeouts
Hi, I have a question regarding the various timeout variables. The service_check_timeout variable in Nagios.cfg hás a value of 99, the command that invokes the remote plugin hás the -t flag value as 170 and the remote plugin hás a timeout value of 130 s. Whats happening is that Nagios return a CRITICAL: Service Check Timed Out error when the remote plugin sometimes exceeds 120 s. My question is which of the timeouts value is being enforced ?? I assume its the 99 s defined in the Nagios.cfg BUT if that is so, how does the plugin log indicates that it executed fine untill 120 s I hope the above was clear ! Thanks in adv, sg - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null