Re: [Nagios-users] Passive monitoring is running slow?
-Original Message- From: Thomas Guyot-Sionnest [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 01, 2007 4:29 PM To: Jonathan Call Cc: nagios-users@lists.sourceforge.net Subject: Re: [Nagios-users] Passive monitoring is running slow? On 01/05/07 05:15 PM, Jonathan Call wrote: I have set up a distributed monitoring system per the Nagios documentation. I initially tested it out by having the distributed server monitor only 24 or so services on about 8 hosts. There didn't seem to be any problems. I then cranked it up to 427 services on 81 hosts. I'm watching the distributed server right now and there is hardly any system load but the Service Check Latency seems extremely high: Metric Min.Max.Average Check Execution Time: 0.05 sec1.67 sec0.701 sec Check Latency: 60.40 sec 287.36 sec 184.514 sec Percent State Change: 0.00% 0.00% 0.00% This is resulting in 50% or less of the service checks completing in the 5 minutes or less timeframe. The Central server has had no significant change in performance at all and seems to be receiving and processing everything without difficulty. The nsca server on the central server is running with the following arguments: /usr/local/sbin/nsca --daemon -c /usr/local/etc/nsca.cfg The submit_check_result script on the distributed server is right out of the documentation. There are many ways to do that; my favorite (obviously since I wrote it :) ) is using the host and service performance data files as named pipes, and having a daemon reaping them and batch-sending data to send_nsca.. The howto is here (and I'll be more than happy to answer your questions or get your feedback): http://www.nagioscommunity.org/wiki/index.php/OCP_Daemon It will require Libevent and the Perl module Event::Lib. Thomas So this is a know design failure in Nagios then? I'm fairly new to Nagios and I am completely dumbfounded at this. If you can't service even a quarter (and probably even a tenth) of the amount of hosts and services on a distributed server than you can on a regular active server then what is the point of having a distributed model at all? I will take a look at your batch sending method. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Passive monitoring is running slow?
-Original Message- From: [EMAIL PROTECTED] [mailto:nagios-users- [EMAIL PROTECTED] On Behalf Of Jonathan Call Sent: Wednesday, May 02, 2007 10:07 AM To: nagios-users@lists.sourceforge.net Subject: Re: [Nagios-users] Passive monitoring is running slow? -Original Message- From: Thomas Guyot-Sionnest [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 01, 2007 4:29 PM To: Jonathan Call Cc: nagios-users@lists.sourceforge.net Subject: Re: [Nagios-users] Passive monitoring is running slow? On 01/05/07 05:15 PM, Jonathan Call wrote: I have set up a distributed monitoring system per the Nagios documentation. I initially tested it out by having the distributed server monitor only 24 or so services on about 8 hosts. There didn't seem to be any problems. I then cranked it up to 427 services on 81 hosts. I'm watching the distributed server right now and there is hardly any system load but the Service Check Latency seems extremely high: MetricMin.Max.Average Check Execution Time: 0.05 sec1.67 sec0.701 sec Check Latency:60.40 sec 287.36 sec 184.514 sec Percent State Change: 0.00% 0.00% 0.00% This is resulting in 50% or less of the service checks completing in the 5 minutes or less timeframe. So this is a know design failure in Nagios then? I'm fairly new to Absolutely not. Nagios and I am completely dumbfounded at this. If you can't service even a quarter (and probably even a tenth) of the amount of hosts and services on a distributed server than you can on a regular active server then what is the point of having a distributed model at all? I have 5 data collector machines running nagios -and- cricket for thousands of services each with nagios reporting all results back to two central hosts as documented. Average latency is 0.689 seconds and Max of 3.65 seconds right now. The distributed server should be performing exactly like a regular active server as far as latency stats are concerned. You're either starving nagios for resources needed to run its active checks (run ~nagios/bin/nagios -s ~nagios/etc/nagios.cfg to see recommended settings) or, less likely, something is wrong with your submit-check-result. If you submit a result from the command line, does it complete in a timely manner? If you disable OCSP does the latency go away? Basic troubleshooting dictates you should try methodically enabling features on your distributed machine to turn it from an active-only server to active submitting check results via OCSP. Disable OCSP program-wide (nagios.cfg) Test Enable OCSP but have your OCSP script do everything except call send_nsca Test Enable send_nsca in your OCSP script. Test ... Do you have regular host checks enabled? Post the output of nagios -v and nagios -s. -- Marc - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Passive monitoring is running slow?
On 01/05/07 05:15 PM, Jonathan Call wrote: I have set up a distributed monitoring system per the Nagios documentation. I initially tested it out by having the distributed server monitor only 24 or so services on about 8 hosts. There didn't seem to be any problems. I then cranked it up to 427 services on 81 hosts. I'm watching the distributed server right now and there is hardly any system load but the Service Check Latency seems extremely high: MetricMin.Max.Average Check Execution Time: 0.05 sec1.67 sec0.701 sec Check Latency:60.40 sec 287.36 sec 184.514 sec Percent State Change: 0.00% 0.00% 0.00% This is resulting in 50% or less of the service checks completing in the 5 minutes or less timeframe. The Central server has had no significant change in performance at all and seems to be receiving and processing everything without difficulty. The nsca server on the central server is running with the following arguments: /usr/local/sbin/nsca --daemon -c /usr/local/etc/nsca.cfg The submit_check_result script on the distributed server is right out of the documentation. There are many ways to do that; my favorite (obviously since I wrote it :) ) is using the host and service performance data files as named pipes, and having a daemon reaping them and batch-sending data to send_nsca.. The howto is here (and I'll be more than happy to answer your questions or get your feedback): http://www.nagioscommunity.org/wiki/index.php/OCP_Daemon It will require Libevent and the Perl module Event::Lib. Thomas - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null