[Nagios-users] Severe peformance issue during major network outage
Hi, I have recently set up Nagios 2.8 and am monitoring 1623 hosts and 1946 services. Performance under normal circumstances is fine. Typical check and latency times are as follows: Monitoring Performance Service Check Execution Time:0.03 / 11.04 / 3.418 sec Service Check Latency: 0.00 / 1.87/ 0.479 sec Host Check Execution Time: 0.03 / 10.04 / 0.843 sec Host Check Latency: 0.00 / 0.00/ 0.000 sec # Active Host / Service Checks: 1623 / 1946 # Passive Host / Service Checks: 0 / 0 The vast majority of these hosts are spread over 320 geographic locations throughout the UK. These locations are connected to our data centre via a hardware VPN device with the majority (about 270) using a private ADSL circuit to facilitate the VPN connection. Yesterday, we had a major outage caused by the failure of one of the ADSL central routers at our ISP. This took out a third of our ADSL sites (roughly 90) for 16 minutes. Each of these sites has about 4 devices monitored by Nagios so in effect about 360 devices (hosts) went down in an instant. As you can imagine, we were aware of the problem almost immediately due to the barrage of phone calls from out clients, but unfortunately Nagios didn't even remotely reflect the current situation. I have used parent child relationships to the full so I was expecting a good portion of the VPN devices to show as down with all other devices behind the VPN device showing as unreachable. This was not the case. It actually took half an hour to find only 20 of these VPN devices down and another half an hour to notice that they were actually back up again having only noticed 20 of the 90 in the first place. During the outage, the service check latency was increasing exponentially and the performance stats half an hour after the start of the problem were as follows: Monitoring Performance Service Check Execution Time:0.03 / 11.04 / 3.646 sec Service Check Latency: 947.84 / 2080.05 / 1467.274 sec Host Check Execution Time: 0.03 / 10.04 / 0.968 sec Host Check Latency: 0.00 / 0.00/ 0.000 sec # Active Host / Service Checks: 1623 / 1946 # Passive Host / Service Checks: 0 / 0 As you can see, the average service check latency time has jumped to 1467 seconds (24 mins). On all of these hosts there is only one service which is a ping (check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5). The host check is also a ping (check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1) but much faster with only 1 ping being sent out. The normal_check_interval on services is 5 mins with 2 max_check_attempts and a retry_interval of 1. The host also has a max_check_attempts of 2. A lot of people have mentioned using fping to speed things up but if my average service latency is only 0.479 seconds in normal circumstances, I can't see how tweaking this will help in a major outage situation. I have also read through the section on tweaking performance which seems to be geared toward protecting the machine Nagios is running on. I want to do the opposite and give Nagios a lot more work to do. The machine is dedicated to Nagios and is quite high spec. It's an IBM xServies 336 with 2 Dual Core processors and 4GB of RAM so it should be able to take a much bigger hit. I have been monitoring CPU performance with MRTG and the CPU performance never goes lower than 90% idle. Ironically during the problem, the machines idle time jumped to 95% when I would have expected to drop rather than increase. The only performance tweak I could see that would affect the performance in this situation is max_concurrent_checks but this is already set to 0. I am fairly new to Nagios (2 months) so I apologise if I have missed something obvious but any pointers to a solution to this problem would be greatly appreciated. I have run a nagios -s (attached below) which seems to indicate that everything is setup ok. Let me know if you require any more information from my config that would help diagnose the problem. regards, Aidan Nagios 2.8 Copyright (c) 1999-2007 Ethan Galstad (http://www.nagios.org) Last Modified: 03-08-2007 License: GPL Projected scheduling information for host and service checks is listed below. This information assumes that you are going to start running Nagios with your current config files. HOST SCHEDULING INFORMATION --- Total hosts: 1624 Total scheduled hosts: 0 Host inter-check delay method: SMART Average host check interval: 0.00 sec Host inter-check delay: 0.00 sec Max host check spread: 30 min First scheduled check: N/A Last scheduled check:N/A SERVICE SCHEDULING INFORMATION --- Total services: 1947 Total scheduled services: 1947 Service inter-check delay method: SMART Average
Re: [Nagios-users] Severe peformance issue during major network outage
On 11/05/07, Aidan Anderson [EMAIL PROTECTED] wrote: A lot of people have mentioned using fping to speed things up but if my average service latency is only 0.479 seconds in normal circumstances, I can't see how tweaking this will help in a major outage situation. check_ping won't finish until it's done all the pings, and the pings are (if I recall) always at one second intervals. This means that if you've configured check_ping to do (let's say) 5 pings, the check_ping plugin will always take at least 5 seconds to complete. If the check_ping is being run as a host check rather than a service check, my understanding is that this is the only thing Nagios will be doing; it doesn't do anything else concurrently (correct me if I'm wrong people). In normal operation, nagios will rarely do a host check, as it only usually bothers to if all of the service checks (which can run concurrently) for that host have failed. When lots of hosts go down at once, you suddenly notice how bad it is to have such slow host checks. check_icmp or check_fping typically complete a whole lot quicker than check_ping. This is because (if I recall correctly) they will finish and return an OK status as soon as they receive the first ping response rather than bothering to do all 5 of them. My nagios system used to crawl even if only half a dozen hosts were down until I changed check_ping to check_fping (and now I use check_icmp but I can't remember if it's any better than check_fping or not). hth, Jim - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Severe peformance issue during major network outage
On 11 May 2007, at 19:03, Jim Avery wrote: On 11/05/07, Aidan Anderson [EMAIL PROTECTED] wrote: A lot of people have mentioned using fping to speed things up but if my average service latency is only 0.479 seconds in normal circumstances, I can't see how tweaking this will help in a major outage situation. check_ping won't finish until it's done all the pings, and the pings are (if I recall) always at one second intervals. This means that if you've configured check_ping to do (let's say) 5 pings, the check_ping plugin will always take at least 5 seconds to complete. If the check_ping is being run as a host check rather than a service check, my understanding is that this is the only thing Nagios will be doing; it doesn't do anything else concurrently (correct me if I'm wrong people). Correct. We noticed this some time ago too: http://altinity.blogs.com/ dotorg/2006/05/immediate_perfo.html If you do stick to using check_ping, use -p 1 which is sub second response time. In normal operation, nagios will rarely do a host check, as it only usually bothers to if all of the service checks (which can run concurrently) for that host have failed. When lots of hosts go down at once, you suddenly notice how bad it is to have such slow host checks. Nagios 3 will do parallelised host checks, so there will not be a slow down there. Also, Ethan said in his presentation at the Netways conference last year that some of the host unreachable logic was not quite right: http://www.netways.de/uploads/media/Ethan.Galstad_Nagios. 3.and.Beyond.pdf This should be fixed in Nagios 3. Ton http://www.altinity.com T: +44 (0)870 787 9243 F: +44 (0)845 280 1725 Skype: tonvoon - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Severe peformance issue during major network outage
Ton Voon wrote: On 11 May 2007, at 19:03, Jim Avery wrote: On 11/05/07, Aidan Anderson [EMAIL PROTECTED] wrote: A lot of people have mentioned using fping to speed things up but if my average service latency is only 0.479 seconds in normal circumstances, I can't see how tweaking this will help in a major outage situation. check_ping won't finish until it's done all the pings, and the pings are (if I recall) always at one second intervals. This means that if you've configured check_ping to do (let's say) 5 pings, the check_ping plugin will always take at least 5 seconds to complete. If the check_ping is being run as a host check rather than a service check, my understanding is that this is the only thing Nagios will be doing; it doesn't do anything else concurrently (correct me if I'm wrong people). Correct. We noticed this some time ago too: http://altinity.blogs.com/ dotorg/2006/05/immediate_perfo.html If you do stick to using check_ping, use -p 1 which is sub second response time. First of all, thank-you for the replies! The majority of devices that I monitor are routers/vpn devices and I have (on the documentation's advice) not set active checks on the hosts and instead I've added check_ping as a service on each of these hosts to do 5 pings as follows: check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5 For the host check I already use as you suggested a check_ping that only does one ping as follows: check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1 My understanding was that if the service check failed it would then abandon the service check altogether and move onto the host check which is only 1 ping. The fact that the service checks are parallelised should mean that it shouldn't matter that there are 5 pings and the host check is only 1 ping which should resolve the bottleneck of serialised host checks. I'm at a loss as to why performance has been impacted so severely. Maybe I need to abandon the service checks altogether and just have a host check. I'm reluctant to do this because I get very useful information from 5 pings, ie packet loss and high rta which is particularly handy for checking volatile links such as ADSL. Maybe that is the trade-off, fast host checking with no useful stats or slow host checking with useful stats. regards, Aidan - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Severe peformance issue during major network outage
On 11 May 2007, at 20:25, Aidan Anderson wrote: First of all, thank-you for the replies! The majority of devices that I monitor are routers/vpn devices and I have (on the documentation's advice) not set active checks on the hosts and instead I've added check_ping as a service on each of these hosts to do 5 pings as follows: check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5 For the host check I already use as you suggested a check_ping that only does one ping as follows: check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1 My understanding was that if the service check failed it would then abandon the service check altogether and move onto the host check which is only 1 ping. The fact that the service checks are parallelised should mean that it shouldn't matter that there are 5 pings and the host check is only 1 ping which should resolve the bottleneck of serialised host checks. I'm at a loss as to why performance has been impacted so severely. Maybe I need to abandon the service checks altogether and just have a host check. I'm reluctant to do this because I get very useful information from 5 pings, ie packet loss and high rta which is particularly handy for checking volatile links such as ADSL. Maybe that is the trade-off, fast host checking with no useful stats or slow host checking with useful stats. Just noticed this in your original email: Host Check Execution Time: 0.03 / 10.04 / 0.843 sec This means that some of your host checks are taking 10 seconds, which is, funnily enough, the timeout period for check_ping. So the -p 1 will still take 10 seconds if the routers are not responding. You can use a timeout flag for check_ping (but is only supported on some OSes). I guess check_icmp is a better bet here. Ton http://www.altinity.com T: +44 (0)870 787 9243 F: +44 (0)845 280 1725 Skype: tonvoon - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Severe peformance issue during major network outage
Ton Voon wrote: On 11 May 2007, at 20:25, Aidan Anderson wrote: First of all, thank-you for the replies! The majority of devices that I monitor are routers/vpn devices and I have (on the documentation's advice) not set active checks on the hosts and instead I've added check_ping as a service on each of these hosts to do 5 pings as follows: check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5 For the host check I already use as you suggested a check_ping that only does one ping as follows: check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1 My understanding was that if the service check failed it would then abandon the service check altogether and move onto the host check which is only 1 ping. The fact that the service checks are parallelised should mean that it shouldn't matter that there are 5 pings and the host check is only 1 ping which should resolve the bottleneck of serialised host checks. I'm at a loss as to why performance has been impacted so severely. Maybe I need to abandon the service checks altogether and just have a host check. I'm reluctant to do this because I get very useful information from 5 pings, ie packet loss and high rta which is particularly handy for checking volatile links such as ADSL. Maybe that is the trade-off, fast host checking with no useful stats or slow host checking with useful stats. Just noticed this in your original email: Host Check Execution Time: 0.03 / 10.04 / 0.843 sec This means that some of your host checks are taking 10 seconds, which is, funnily enough, the timeout period for check_ping. So the -p 1 will still take 10 seconds if the routers are not responding. You can use a timeout flag for check_ping (but is only supported on some OSes). I guess check_icmp is a better bet here. Ton Hi Ton, Well spotted, thank-you. check_icmp here we come :) thanks Aidan - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null