Re: [Nagios-users] Performance issues, too
Hi! On Tue, 02 Jan 2007, Daniel Meyer wrote: > Program Running Time: 10d 21h 22m 42s > > So, for almost eleven days nagios runs smoothly now, no more > latency problems. I'll try it again with EPN (but still without > perlcache) now. I've finally gotten around to recompile Nagios without EPN and without the Perlcache. As you can see on these graphs: http://eric.schwarzvogel.de/~klausman/nagios-perf-3/ (especially http://eric.schwarzvogel.de/~klausman/nagios-perf-3/latencies.png ) I didn't quite help (much). While the curve now has a flatter slope and it even goes down in spots, it still seems to ever increase on the whole. Even it would stay on the level we saw last night (~100s check latency) I wouldn't be too happy. With a 300s check interval, 100s latency is just too much (IMHO). What's left is enabling Perlcache again (yet keeping EPN off). I'm not terribly hopeful that that will help, but I'm running out of ideas quickly. Also note that switching *off* EPN/PC led to *less* CPU usage. Strange, isn't it? Regards, Tobias -- Never touch a burning system. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Robert Hajime Lanning wrote: > >> Just rechecked. After 72 hours nagios still runs perfectly >> with an average service check latency of 0.3 seconds, max. >> 0.9 seconds. >> >> Memory usage is perfectly "flat" now, with epn and perlcache >> it went from 140 mb (whole system) to about 900 mb within 24h. >> >> The average system load is a bit _lower_ than before, but some >> peaks higher than with epn/perlcache. >> >> I'll try pure epn without perlcache first thing in january. > > The main reason for me to use ePN with perlcache, is to get > around the huge load of loading all the MIBs for each SNMP > query. (Since 90% of my services are SNMP queries.) I was > looking for a way to load the MIB tree once, and found I could > do it in p1.pl. > If you use SNMP oid's rather than their "human-readable" mib-names, you don't need to load a single mib. It is indeed a much simpler solution. > For traps, I run snmptrapd (from net-snmp) and have just recently > found it has a memory leak. Over the course of 20 days, it grew > from 5MB to 140MB. It runs snmptthandler, which is actually a C > program (I ported the Perl version to reduce the load during trap > floods). > > snmptt has a big memory leak. I restart it every 6 hours. > > This seems to be pointing to the net-snmp libraries. > > Though, I don't get why it would really effect the nagios master > process. Since all the calls to the SNMP module are run in a > subprocess, other than the initialization that I put into p1.pl. > Unless p1.pl is executed more than once. > strace -e open ./nagios 2>&1 | grep p1.pl will tell you, although strace might not be included in the tool-box shipped with your system. > Back when I had about 200 service checks, my load was about 1.5. > Then I enabled ePN with perlcache and stuck in the "use SNMP" > with the preload of the MIBs. Load went down to 0.3. But, as > I added services, most SNMP, this issue showed up. > Try without perlcache, and try with OID's and without your p1.pl hack. -- Andreas Ericsson [EMAIL PROTECTED] OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Robert Hajime Lanning wrote: > I have also been having performance issues with Nagios 2.5 on > a Sun E220R with two 400MHz procs and 1GB ram. > > Sys stats are at http://lanning.cc/kipper.html > > The large dips in load and system CPU time are when I restart > Nagios. (cron'd twice a week, but I have also been making > a lot of service updates lately, hence the almost once a day > restarts.) For the restarts to fix the latency, I have > "use_retained_scheduling_info=0". > > After about three days the Service Check latency will grow > to over 300 seconds. It is usually steady at around 0-5 > seconds, for a couple of days, then it will rise over the > course of a few hours to over the 300 second mark. > This is a bit bizarre and simply must be related to something else. Does Nagios run out of commandbuffer slots? Aren't they freed properly? > > I have noticed the Nagios seems to have a memory leak. As, > I have watched over the last hour the process grow from 124M > to 126M. > This can probably be attributed to the fact that Nagios fork()'s, then frees and allocates memory before running execve() in a thread. This isn't per se prohibited, but strongly discouraged. I wouldn't be surprised to find that other applications that do the same thing will leak memory on Sun. On Linux, threads are created in a 1-1 fashion (meaning each thread is actually its own process). This holds true for some other systems as well, and afaik there are 1-1 thread implementations for Sun as well. In any case, the 1-1 thing means that the kernel cleans up any left-over memory for the processes when they exit, which isn't necessarily the case in a 1-many relationship thread implementation. Possibly worth investigating. > I use ePN with caching. Most of my checks are SNMP requests > via ePN scripts (http://lanning.cc/custom_plugins/), with > p1.pl modified with: > > use SNMP 5.0; > SNMP::loadModules("ALL"); > Forgive a novice, but doesn't this make it load all SNMP submodules each time it runs a perl-module? That would certainly be a major impact on load and could well lead to memory leaks (assuming the submodules aren't always freed after having been loaded). > We have put into our budget to move Nagios to a Linux/Intel > server. But, what bugs me is the high CPU time in kernel > space, because of Nagios. > Again, this is a behaviour not regularly experienced on Linux (which is the base for most Nagios installations). Linux is simply very, very good at fork(). It doesn't do bother even trying to do other things properly (like 1-many threading), simply because it's so damn good at forking. It would be interesting to see if your problems go away when you move to Linux. I'm not saying it's superior to Solaris, but afaiu, Ethan runs all his tests on Linux and would certainly have found bugs of this kind if they had bitten him. -- Andreas Ericsson [EMAIL PROTECTED] OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Hi there, and happy new year :-) Program Running Time: 10d 21h 22m 42s So, for almost eleven days nagios runs smoothly now, no more latency problems. I'll try it again with EPN (but still without perlcache) now. Danny -- Q: Gentoo is too hard to install =http://www.cyberdelia.de and I feel like whining. = [EMAIL PROTECTED] A: Please see /dev/null. = (from the gentoo installer FAQ) = \o/ - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
> On Mon, 25 Dec 2006, Robert Hajime Lanning wrote: >> I have a few that use the output of the last check to see >> differences in accumulators and the like. And I see that >> the caching code caches a parsed version of the arguments. >> This caching has no expirations just appending the new >> argument list. > > That might explain memory consumption, though one has to wonder > if linear increase is fast enough to explain it. If the > arguments get *doubled* everytime, though... Ok, I have done two things. 1) removed caching of arguments (it now parses arguments every time, in the child process) 2) I have modified all my perl based checks to use "my" instead of "use vars". This is to scope the variables to the package that the service check is encapsulated in. So, now my load seems to have lowered to about 1.5 from 2.2. The CPU time in kernel space no longer grows in that curve fashion as seen in the graphs posted earlier. It still grows but now more linearly and at a slower pace. I started a cron job, every five minutes, that logs the size of the master Nagios process in kilobytes. (pagesize is 8k) The drop from 14920 to 11920 was the cron'd restart of Nagios. [Thu Dec 28 00:00:01 UTC 2006] 14920 [Thu Dec 28 00:05:00 UTC 2006] 11920 [Thu Dec 28 00:10:00 UTC 2006] 11960 [Thu Dec 28 00:15:00 UTC 2006] 11992 [Thu Dec 28 00:20:00 UTC 2006] 12000 [Thu Dec 28 00:25:00 UTC 2006] 12000 [Thu Dec 28 00:30:00 UTC 2006] 12000 [Thu Dec 28 00:35:00 UTC 2006] 12024 [Thu Dec 28 00:40:00 UTC 2006] 12032 [Thu Dec 28 00:45:00 UTC 2006] 12048 [Thu Dec 28 00:50:00 UTC 2006] 12056 [Thu Dec 28 00:55:00 UTC 2006] 12072 The service check cache code is implemented in the p1.pl, so I have been really looking at it. Once the check is compiled, there is a really short code path in p1.pl for the Nagios master process. So, I think the leak is more in ePN, than perlcache. -- And, did Galoka think the Ulus were too ugly to save? -Centauri - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Hi! On Mon, 25 Dec 2006, Robert Hajime Lanning wrote: > > I think the two issues are independent (or at most correlated). > > If switching off EPN/perlcache fixes the issues for me, too, I'd > > guess it's either the embedded Perl or the cache. Finding out > > which is a matter of simple experimentation. I hope :) > > > > Does any of your checks have arguments that change? No, I don't think so. If there's no implicit carry-over in a plugin, we don't do that at all. > I have a few that use the output of the last check to see > differences in accumulators and the like. And I see that > the caching code caches a parsed version of the arguments. > This caching has no expirations just appending the new > argument list. That might explain memory consumption, though one has to wonder if linear increase is fast enough to explain it. If the arguments get *doubled* everytime, though... > I am trying to comment out the caching of arguments and have > the arguments parsed each time. Good luck. > > Merry christmas to the lot of you, btw. > > > > Regards, > > Tobias > > (away from work and Nagios 'til January 8th) > > Merry Christmas, and I am too much a geek to leave this be, > until January. :) (Have to tinker...) Oh, I do have my own private projects I can tinker with :) Regards, Tobias -- Never touch a burning system. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
> I'm not using a single SNMP check, and I have the very same > problem: so I'd say no. Ok, seperate issues... :) > I think the two issues are independent (or at most correlated). > If switching off EPN/perlcache fixes the issues for me, too, I'd > guess it's either the embedded Perl or the cache. Finding out > which is a matter of simple experimentation. I hope :) > Does any of your checks have arguments that change? I have a few that use the output of the last check to see differences in accumulators and the like. And I see that the caching code caches a parsed version of the arguments. This caching has no expirations just appending the new argument list. I am trying to comment out the caching of arguments and have the arguments parsed each time. > Merry christmas to the lot of you, btw. > > Regards, > Tobias > (away from work and Nagios 'til January 8th) Merry Christmas, and I am too much a geek to leave this be, until January. :) (Have to tinker...) -- And, did Galoka think the Ulus were too ugly to save? -Centauri - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Hi! On Mon, 25 Dec 2006, Robert Hajime Lanning wrote: > > > > Just rechecked. After 72 hours nagios still runs perfectly > > with an average service check latency of 0.3 seconds, max. > > 0.9 seconds. > > > > Memory usage is perfectly "flat" now, with epn and perlcache > > it went from 140 mb (whole system) to about 900 mb within 24h. > > > > The average system load is a bit _lower_ than before, but some > > peaks higher than with epn/perlcache. > > > > I'll try pure epn without perlcache first thing in january. (pardon my butting in here) I'll do that, too. > The main reason for me to use ePN with perlcache, is to get > around the huge load of loading all the MIBs for each SNMP > query. (Since 90% of my services are SNMP queries.) I was > looking for a way to load the MIB tree once, and found I could > do it in p1.pl. > > For traps, I run snmptrapd (from net-snmp) and have just recently > found it has a memory leak. Over the course of 20 days, it grew > from 5MB to 140MB. It runs snmptthandler, which is actually a C > program (I ported the Perl version to reduce the load during trap > floods). > > snmptt has a big memory leak. I restart it every 6 hours. > > This seems to be pointing to the net-snmp libraries. I'm not using a single SNMP check, and I have the very same problem: so I'd say no. > Though, I don't get why it would really effect the nagios master > process. Since all the calls to the SNMP module are run in a > subprocess, other than the initialization that I put into p1.pl. > Unless p1.pl is executed more than once. > > Back when I had about 200 service checks, my load was about 1.5. > Then I enabled ePN with perlcache and stuck in the "use SNMP" > with the preload of the MIBs. Load went down to 0.3. But, as > I added services, most SNMP, this issue showed up. I think the two issues are independent (or at most correlated). If switching off EPN/perlcache fixes the issues for me, too, I'd guess it's either the embedded Perl or the cache. Finding out which is a matter of simple experimentation. I hope :) Merry christmas to the lot of you, btw. Regards, Tobias (away from work and Nagios 'til January 8th) -- Never touch a burning system. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
> Just rechecked. After 72 hours nagios still runs perfectly > with an average service check latency of 0.3 seconds, max. > 0.9 seconds. > > Memory usage is perfectly "flat" now, with epn and perlcache > it went from 140 mb (whole system) to about 900 mb within 24h. > > The average system load is a bit _lower_ than before, but some > peaks higher than with epn/perlcache. > > I'll try pure epn without perlcache first thing in january. The main reason for me to use ePN with perlcache, is to get around the huge load of loading all the MIBs for each SNMP query. (Since 90% of my services are SNMP queries.) I was looking for a way to load the MIB tree once, and found I could do it in p1.pl. For traps, I run snmptrapd (from net-snmp) and have just recently found it has a memory leak. Over the course of 20 days, it grew from 5MB to 140MB. It runs snmptthandler, which is actually a C program (I ported the Perl version to reduce the load during trap floods). snmptt has a big memory leak. I restart it every 6 hours. This seems to be pointing to the net-snmp libraries. Though, I don't get why it would really effect the nagios master process. Since all the calls to the SNMP module are run in a subprocess, other than the initialization that I put into p1.pl. Unless p1.pl is executed more than once. Back when I had about 200 service checks, my load was about 1.5. Then I enabled ePN with perlcache and stuck in the "use SNMP" with the preload of the MIBs. Load went down to 0.3. But, as I added services, most SNMP, this issue showed up. -- And, did Galoka think the Ulus were too ugly to save? -Centauri - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
On Sun, 24 Dec 2006, Joerg Linge wrote: >> I have watched over the last hour the process grow from 124M >> to 126M. >> >> I use ePN with caching. Most of my checks are SNMP requests >> via ePN scripts (http://lanning.cc/custom_plugins/), with >> p1.pl modified with: >> >> use SNMP 5.0; >> SNMP::loadModules("ALL"); > > This sounds like Daniels Problem. Indeed. > Two days ago we have compiled nagios without epn and perl cache. > For now Nagios runs with a latency of 0.3 Secs. Just rechecked. After 72 hours nagios still runs perfectly with an average service check latency of 0.3 seconds, max. 0.9 seconds. Memory usage is perfectly "flat" now, with epn and perlcache it went from 140 mb (whole system) to about 900 mb within 24h. The average system load is a bit _lower_ than before, but some peaks higher than with epn/perlcache. I'll try pure epn without perlcache first thing in january. Danny -- Q: Gentoo is too hard to install =http://www.cyberdelia.de and I feel like whining. = [EMAIL PROTECTED] A: Please see /dev/null. = (from the gentoo installer FAQ) = \o/ - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Am Sonntag, 24. Dezember 2006 11:35 schrieb Robert Hajime Lanning: > I have also been having performance issues with Nagios 2.5 on > a Sun E220R with two 400MHz procs and 1GB ram. [...] > I have noticed the Nagios seems to have a memory leak. As, > I have watched over the last hour the process grow from 124M > to 126M. > > I use ePN with caching. Most of my checks are SNMP requests > via ePN scripts (http://lanning.cc/custom_plugins/), with > p1.pl modified with: > > use SNMP 5.0; > SNMP::loadModules("ALL"); This sounds like Daniels Problem. Two days ago we have compiled nagios without epn and perl cache. For now Nagios runs with a latency of 0.3 Secs. Jörg - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
I have also been having performance issues with Nagios 2.5 on a Sun E220R with two 400MHz procs and 1GB ram. Sys stats are at http://lanning.cc/kipper.html The large dips in load and system CPU time are when I restart Nagios. (cron'd twice a week, but I have also been making a lot of service updates lately, hence the almost once a day restarts.) For the restarts to fix the latency, I have "use_retained_scheduling_info=0". After about three days the Service Check latency will grow to over 300 seconds. It is usually steady at around 0-5 seconds, for a couple of days, then it will rise over the course of a few hours to over the 300 second mark. My biggest issue with this, is the fact the RRDTool does not like the data points to be that far out of the expected time intervals and will toss the data point. I have noticed the Nagios seems to have a memory leak. As, I have watched over the last hour the process grow from 124M to 126M. I use ePN with caching. Most of my checks are SNMP requests via ePN scripts (http://lanning.cc/custom_plugins/), with p1.pl modified with: use SNMP 5.0; SNMP::loadModules("ALL"); We have put into our budget to move Nagios to a Linux/Intel server. But, what bugs me is the high CPU time in kernel space, because of Nagios. --- $ nagios -s etc/nagios.cfg Nagios 2.5 Copyright (c) 1999-2006 Ethan Galstad (http://www.nagios.org) Last Modified: 07-13-2006 License: GPL Projected scheduling information for host and service checks is listed below. This information assumes that you are going to start running Nagios with your current config files. HOST SCHEDULING INFORMATION --- Total hosts: 83 Total scheduled hosts: 0 Host inter-check delay method: SMART Average host check interval: 0.00 sec Host inter-check delay: 0.00 sec Max host check spread: 3 min First scheduled check: N/A Last scheduled check:N/A SERVICE SCHEDULING INFORMATION --- Total services: 693 Total scheduled services: 693 Service inter-check delay method: SMART Average service check interval: 192.12 sec Inter-check delay: 0.26 sec Interleave factor method: SMART Average services per host: 8.35 Service interleave factor: 9 Max service check spread: 3 min First scheduled check: Sun Dec 24 10:02:16 2006 Last scheduled check: Sun Dec 24 10:05:15 2006 CHECK PROCESSING INFORMATION Service check reaper interval: 5 sec Max concurrent service checks: Unlimited PERFORMANCE SUGGESTIONS --- I have no suggestions - things look okay. -- And, did Galoka think the Ulus were too ugly to save? -Centauri - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Hi! On Thu, 21 Dec 2006, Daniel Meyer wrote: > > I have the suspicion that our check latency might converge on 419 > > seconds - but I'd rather not test it, we'd be well beyond the > > 300s-interval most of our checks are designed for. > > Why do you think of exactly 419 seconds? > > And btw, if our problems are related the latency wont stop at that number > :) Because that's the new average check latency as reported by -s. Yes, I'm out on a limb there. Regards, Tobias -- Never touch a burning system. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
On Thu, 21 Dec 2006, Tobias Klausmann wrote: > I have the suspicion that our check latency might converge on 419 > seconds - but I'd rather not test it, we'd be well beyond the > 300s-interval most of our checks are designed for. Why do you think of exactly 419 seconds? And btw, if our problems are related the latency wont stop at that number :) Danny -- Q: Gentoo is too hard to install =http://www.cyberdelia.de and I feel like whining. = [EMAIL PROTECTED] A: Please see /dev/null. = (from the gentoo installer FAQ) = \o/ - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Hi! On Tue, 19 Dec 2006, Andreas Ericsson wrote: > >>> SERVICE SCHEDULING INFORMATION > >>> --- > >>> Total services: 2836 > >>> Total scheduled services: 2836 > >>> Service inter-check delay method: SMART > >>> Average service check interval: 2225.56 sec > >> This is, as you point out below, quite odd. What's your _longest_ > >> normal_check_interval for services? > > > > The longest check_interval is 86400 seconds. It's a SSL cert > > freshness check. I figured it wasn't necesseary to check that > > more often than once a day. I also have check_intervals of 3, 5, > > 15, 20, 30 and 1440 seconds. The latter is also a cert freshness > > check which is lower because the customer wanted it to be that > > short. > > > > Try changing the really long intervals to something shorter or > commenting them out completely and see what happens. Checking a > certificate is not a particularly heavy operation so it doesn't matter > much if you run it ever 5 minutes. On the server side it just gets > handed out from cache, so it's not heave there either. Actually, I was horribly wrong with that statement up there. As it turned out, the check_interval was set to 86400. From that I jumped to the conclusion "ah, one day" - familiar numbers do that to you. But the base unit of check_interval isn't 1s, it's 1 minute. So the check_interval was 60 days. Fortunately, it was only one such check which we quickly eliminated before producing the second set of graphs I mentioned elsewhere in the thread. Now, the longest check_interval truly is one day, 1440 minutes. The average service check interval reported by -s is now 419 seconds. Still not terribly short, but it proves that the 86400-minute-monster was to blame for the 2200+ seconds. Changing those once-a-day checks to 5 minutes is an option, but I'd rather wait a little to give everybody on the list some time to look at the graphs and come up with nifty ideas. I have the suspicion that our check latency might converge on 419 seconds - but I'd rather not test it, we'd be well beyond the 300s-interval most of our checks are designed for. > > Oops, forgot to mention that. Yes, a server farm is being rebuilt > > currently. As I didn't want all the host check timeouts to make > > matters much, much, worse, I disabled them entirely. > > > > Ah, that explains it then. It shouldn't matter, but unless the > experiment I suggested above turns up anything useful, would you mind > commenting them out and testing that? I'll do that if removing the day-spaced-checks doesn't help. Regards & Thanks, Tobias -- Never touch a burning system. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Hi! On Thu, 21 Dec 2006, Daniel Meyer wrote: > - it is not triggered by any other software on the server >(nagios and apache are the only things running there) ACK. > - its not triggered by hourly, daily or weekly cronjobs With a lot of guessing and estimating, I can make a case for a slight "plateau" right after the hour, with an increase in the second half of the hour. Might be completely bogus, though. > - the big service check latency goes away instantly after a restart >of nagios ACK. > - the latency skyrockets after "some time", its not like "six hours >after the restart" or something like that Well, not so much as skyrocketing, steadily creeping up. See the images I reference below. > - service check execution time does NOT change at all, it stays on >the same level all the time NACK. For me, it starts out at some low-two-digit ms time, then creeps up to 165.000ms (yes, exactly that value). As far as I can tell, it stays there forever. > - changing from a dummy host check to "adaptive" host checks back and >forth doesn't make a difference We didn't try that. > - i see memory usage rise proportional to the latency, but there is >way enough free memory left (this morning it was 150 seconds latency >but still 790 Megs free ram, plus one gig cached) Same (with slightly different figures) here. > - load on the system rises a little but not much It's measurable, but definitely not maxed out. Same goes for CPU utilization (which is something different)> > - network usage goes down (well there are less checks done due to the >latency, so no surprise here) We haven't checked that but as network traffic (both volume and packet rate) wasn't near any limit, we didn't feel it was necessary. Here are a few graphs we created for yesterday and the day before that: http://eric.schwarzvogel.de/~klausman/nagios-perf-1/ and here are the pics of today and yesterday afternoon: http://eric.schwarzvogel.de/~klausman/nagios-perf-2/ For all graphs, check frequency was every 2 minutes. For the older set, a SNAFU on my part when setting up the RRDs resulted in reduced resolution. That was fixed with the second set. "Queue size" is calculated the following way: look at all objects in the state file (retention.dat, saved every 20s). Every object with a check time in the past counts as one queue entry. "Slots"/"Checks completed" is a what nagiostats reports as # of checks completed in the four timeframes. Things I noted: Queue size oscillates wildly. This might be due to my methodology. Still, one can read a trend from that curve. Check execution time converges at 106ms. On the spot. I have no idea why. Load average and CPU idleness indicate that we don't have a host performance problem (I also looked over but did not plot stuff like interrupt rate and context switches, nothing overly high, there). For the older graphs, check latency doesn't budge at all for some time (or it's too little to see it). For the newer graph, the initial rise is rather steep, then increase slows down a bit. Still, over the course of hours, it seems linear and shows no sign of converging. If anybody is interested in the RRD files used to generate the graphs, drop me a line. The picture all of this paints is rather inconclusive. We've found an oddity in our config I'll relate in another mail (a check interval of 86400 minutes, that's two months). We have eliminated that for the newer graphs, however. In conclusion, I'm at a loss as to why this slow deterioration of check performance happens. A colleague of mine is looking at the Nagios scheduling code (he thinks the description of the algorithm in the docs is rather strange). He hasn't reported back yet, though. All in all, every hint is appreciated. Regards, Tobias -- Never touch a burning system. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Ok, this is what i noticed on my performance issues during the last days: - it is not triggered by any other software on the server (nagios and apache are the only things running there) - its not triggered by hourly, daily or weekly cronjobs - the big service check latency goes away instantly after a restart of nagios - the latency skyrockets after "some time", its not like "six hours after the restart" or something like that - service check execution time does NOT change at all, it stays on the same level all the time - changing from a dummy host check to "adaptive" host checks back and forth doesn't make a difference - i see memory usage rise proportional to the latency, but there is way enough free memory left (this morning it was 150 seconds latency but still 790 Megs free ram, plus one gig cached) - load on the system rises a little but not much - network usage goes down (well there are less checks done due to the latency, so no surprise here) Details of my setup can be found in the "big performance issue..."-thread, if needed i can repost them here... Danny -- Q: Gentoo is too hard to install =http://www.cyberdelia.de and I feel like whining. = [EMAIL PROTECTED] A: Please see /dev/null. = (from the gentoo installer FAQ) = \o/ - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Hi! On Tue, 19 Dec 2006, Andreas Ericsson wrote: > >>> --- > >>> Total services: 2836 > >>> Total scheduled services: 2836 > >>> Service inter-check delay method: SMART > >>> Average service check interval: 2225.56 sec > >> This is, as you point out below, quite odd. What's your _longest_ > >> normal_check_interval for services? > > > > The longest check_interval is 86400 seconds. It's a SSL cert > > freshness check. I figured it wasn't necesseary to check that > > more often than once a day. I also have check_intervals of 3, 5, > > 15, 20, 30 and 1440 seconds. The latter is also a cert freshness > > check which is lower because the customer wanted it to be that > > short. > > Try changing the really long intervals to something shorter or > commenting them out completely and see what happens. Checking a > certificate is not a particularly heavy operation so it doesn't matter > much if you run it ever 5 minutes. On the server side it just gets > handed out from cache, so it's not heave there either. > > If you have the various normal_check_interval's specified in templates, > try setting them all to 5 minutes and let Nagios run over-night. If this > interferes with some fragile services on the network (webservers whose > sessions don't expire, fe), disable active checks for those services > during the testing period. > > (yes, this might seem braindead, but I really need to know if this bug > is still in Nagios). I'll do that this afternoon, I'd just like to wait a little more regarding the changes my kernel/cpu-update brings (or doesn't). > >>> *Or* it is indicative of a misconfiguration on my > >>> part. If the latter is the case, I'd be eager, nay ecstatic to > >>> hear what I did wrong. Here are a few of the config vars that > >>> might influence this: > >> There has been a slight thinko in Nagios. I don't know if it's still > >> there in recent CVS versions. The thinko is that it (used to?) calculate > >> average service check interval by adding up all normal_check_interval > >> values and dividing it by the number of services configured (or > >> something along those lines), which leads to long latencies. This > >> normally didn't make those latencies increase though. Humm... > > > > Well, the numbers sure do get whacky after a restart: first it > > skyrockets for about five minutes, then plummets to 1s. From > > there it works its way up the way I described. > > Are the first checks of things being scheduled with unreasonably long > delays? Fe, a check with 3 minute normal_check_interval being scheduled > an hour or so into the future. Usually, yes. As I use state retention, I don't believe in the initial numbers all that much. After about 5-10 minutes one can usually make out a trend. Not this time, though. Here's hoping that it keeps oscillating around the 8-9 seconds I currently. > >>> Total Services: 2836 > >>> Services Checked: 2836 > >>> Services Scheduled: 2758 > >>> Active Service Checks:2836 > >>> Passive Service Checks: 0 > >> All services aren't being scheduled, but you have no passive service > >> checks. Have you disabled checks of 78 services? > > > > Oops, forgot to mention that. Yes, a server farm is being rebuilt > > currently. As I didn't want all the host check timeouts to make > > matters much, much, worse, I disabled them entirely. > > Ah, that explains it then. It shouldn't matter, but unless the > experiment I suggested above turns up anything useful, would you mind > commenting them out and testing that? I was planning to do that tomorrow for the very same reasons. > >>> Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface. > >>> LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both > >>> around 40% idle most of the time. I see about 300 context > >>> switches and 500 interrupts per second. The network load is > >>> neglible, ditto the packet rate. > >>> > >>> The way these figures look I don't see a performance problem per > >>> se, but maybe I have overlooked a metric that descirbes the > >>> "usual" bottleneck of installations. > >>> > >> Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel > >> cpu's, that causes up to 60% performance loss (yes, it really is that bad). > > > > Sheesh. Yes, it is a 32-bit installation. I only ever bothered > > with 64-bit installs on Opteron hardware. I might look into > > migrating to 64 bits, then. > > > > So the CPU's are 64-bits? Humm... 64-bit mode would boost available > resources quite a bit, but as you just enabled HT you should now have 3 > extra CPU's (Xeon's are dualcore AFAIR) which will probably set you safe > for a while. Colleague just told me that this particular batch wasn't available in 64 bits. So no, they're 32bits, well one thing to test out of the way :-/ > >> I'm puzzled. Please let me know
Re: [Nagios-users] Performance issues, too
Hi! On Tue, 19 Dec 2006, Daniel Meyer wrote: > >> You could lower this to 2 seconds. I've done so on any number of > >> installations and it has no negative impact what so ever, but seems to > >> make Nagios a bit more responsive. > > > > I'll give that a try. > > I've tried that but had some failing checks when i did that. Very > strange... I'm still waiting how the kernel change will work out. > > I also noticed that HT was disabled on the machine. I've changed > > that (and added support for it to the kernel) when I did the > > kernel upgrade today. I'll keep an eye on check latency. > > I have HT enabled, no effect on the nagios latency problems. I've now setup a little script that puts host and service check latency in an RRD file every five minutes. So far, the curve looks very inconclusive. Regards, Tobias -- Never touch a burning system. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
On Tue, 19 Dec 2006, Tobias Klausmann wrote: > I'm running 2.6 now but I had the troubles with 2.5 initially. > OS is a Gentoo Linux, Kernel 2.6.15.5 initially, upgrade to > 2.6.19 today. Same here. Latency-Problems with both 2.5 and 2.6, but on CentOS 4.4 (good that you use gentoo, saves me the time to try it on a heavy optimized gentoo box :) >> You could lower this to 2 seconds. I've done so on any number of >> installations and it has no negative impact what so ever, but seems to >> make Nagios a bit more responsive. > > I'll give that a try. I've tried that but had some failing checks when i did that. Very strange... > I also noticed that HT was disabled on the machine. I've changed > that (and added support for it to the kernel) when I did the > kernel upgrade today. I'll keep an eye on check latency. I have HT enabled, no effect on the nagios latency problems. Danny -- Q: Gentoo is too hard to install =http://www.cyberdelia.de and I feel like whining. = [EMAIL PROTECTED] A: Please see /dev/null. = (from the gentoo installer FAQ) = \o/ - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Tobias Klausmann wrote: > Hi! > > On Tue, 19 Dec 2006, Andreas Ericsson wrote: >> Thanks for an excellently detailed problem report, missing only the >> Nagios version and system type/version info. I've got some comments and >> followup questions. See below. > > I'm running 2.6 now but I had the troubles with 2.5 initially. > OS is a Gentoo Linux, Kernel 2.6.15.5 initially, upgrade to > 2.6.19 today. > >>> --- >>> Total hosts: 330 >>> Total scheduled hosts: 0 >> No scheduled host-checks. That's good, cause they interfere with normal >> operations in Nagios. > > I've read as much. In my seperate mail I had a few questions > about it, let's keep them (and the answers there ;) > >>> Host inter-check delay method: SMART >>> Average host check interval: 0.00 sec >>> Host inter-check delay: 0.00 sec >>> Max host check spread: 10 min >>> First scheduled check: N/A >>> Last scheduled check:N/A >>> >>> >>> SERVICE SCHEDULING INFORMATION >>> --- >>> Total services: 2836 >>> Total scheduled services: 2836 >>> Service inter-check delay method: SMART >>> Average service check interval: 2225.56 sec >> This is, as you point out below, quite odd. What's your _longest_ >> normal_check_interval for services? > > The longest check_interval is 86400 seconds. It's a SSL cert > freshness check. I figured it wasn't necesseary to check that > more often than once a day. I also have check_intervals of 3, 5, > 15, 20, 30 and 1440 seconds. The latter is also a cert freshness > check which is lower because the customer wanted it to be that > short. > Try changing the really long intervals to something shorter or commenting them out completely and see what happens. Checking a certificate is not a particularly heavy operation so it doesn't matter much if you run it ever 5 minutes. On the server side it just gets handed out from cache, so it's not heave there either. If you have the various normal_check_interval's specified in templates, try setting them all to 5 minutes and let Nagios run over-night. If this interferes with some fragile services on the network (webservers whose sessions don't expire, fe), disable active checks for those services during the testing period. (yes, this might seem braindead, but I really need to know if this bug is still in Nagios). > >>> *Or* it is indicative of a misconfiguration on my >>> part. If the latter is the case, I'd be eager, nay ecstatic to >>> hear what I did wrong. Here are a few of the config vars that >>> might influence this: >> There has been a slight thinko in Nagios. I don't know if it's still >> there in recent CVS versions. The thinko is that it (used to?) calculate >> average service check interval by adding up all normal_check_interval >> values and dividing it by the number of services configured (or >> something along those lines), which leads to long latencies. This >> normally didn't make those latencies increase though. Humm... > > Well, the numbers sure do get whacky after a restart: first it > skyrockets for about five minutes, then plummets to 1s. From > there it works its way up the way I described. > Are the first checks of things being scheduled with unreasonably long delays? Fe, a check with 3 minute normal_check_interval being scheduled an hour or so into the future. >>> Total Services: 2836 >>> Services Checked: 2836 >>> Services Scheduled: 2758 >>> Active Service Checks:2836 >>> Passive Service Checks: 0 >> All services aren't being scheduled, but you have no passive service >> checks. Have you disabled checks of 78 services? > > Oops, forgot to mention that. Yes, a server farm is being rebuilt > currently. As I didn't want all the host check timeouts to make > matters much, much, worse, I disabled them entirely. > Ah, that explains it then. It shouldn't matter, but unless the experiment I suggested above turns up anything useful, would you mind commenting them out and testing that? >>> Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface. >>> LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both >>> around 40% idle most of the time. I see about 300 context >>> switches and 500 interrupts per second. The network load is >>> neglible, ditto the packet rate. >>> >>> The way these figures look I don't see a performance problem per >>> se, but maybe I have overlooked a metric that descirbes the >>> "usual" bottleneck of installations. >>> >> Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel >> cpu's, that causes up to 60% performance loss (yes, it really is that bad). > > Sheesh. Yes, it is a 32-bit installation. I only ever bothered > with 64-bit installs on Opteron hardware. I might look into > migrating to 64 bits, then. > So the CPU'
Re: [Nagios-users] Performance issues, too
Hi! On Tue, 19 Dec 2006, Andreas Ericsson wrote: > Thanks for an excellently detailed problem report, missing only the > Nagios version and system type/version info. I've got some comments and > followup questions. See below. I'm running 2.6 now but I had the troubles with 2.5 initially. OS is a Gentoo Linux, Kernel 2.6.15.5 initially, upgrade to 2.6.19 today. > > --- > > Total hosts: 330 > > Total scheduled hosts: 0 > > No scheduled host-checks. That's good, cause they interfere with normal > operations in Nagios. I've read as much. In my seperate mail I had a few questions about it, let's keep them (and the answers there ;) > > Host inter-check delay method: SMART > > Average host check interval: 0.00 sec > > Host inter-check delay: 0.00 sec > > Max host check spread: 10 min > > First scheduled check: N/A > > Last scheduled check:N/A > > > > > > SERVICE SCHEDULING INFORMATION > > --- > > Total services: 2836 > > Total scheduled services: 2836 > > Service inter-check delay method: SMART > > Average service check interval: 2225.56 sec > > This is, as you point out below, quite odd. What's your _longest_ > normal_check_interval for services? The longest check_interval is 86400 seconds. It's a SSL cert freshness check. I figured it wasn't necesseary to check that more often than once a day. I also have check_intervals of 3, 5, 15, 20, 30 and 1440 seconds. The latter is also a cert freshness check which is lower because the customer wanted it to be that short. > > CHECK PROCESSING INFORMATION > > > > Service check reaper interval: 5 sec > > You could lower this to 2 seconds. I've done so on any number of > installations and it has no negative impact what so ever, but seems to > make Nagios a bit more responsive. I'll give that a try. > > Max concurrent service checks: Unlimited > > I assume you aren't running in to hardware limits on this machine. > What's the normal load when you're running nagios? If it's > NUM_CPUS > then you most likely don't have beefy enough hardware. That's hardly > ever the case though, so don't bother looking into it unless all else fails. > > Nvm, question answered below. Hardware resources should be no problem > what so ever. I also noticed that HT was disabled on the machine. I've changed that (and added support for it to the kernel) when I did the kernel upgrade today. I'll keep an eye on check latency. > > *Or* it is indicative of a misconfiguration on my > > part. If the latter is the case, I'd be eager, nay ecstatic to > > hear what I did wrong. Here are a few of the config vars that > > might influence this: > > There has been a slight thinko in Nagios. I don't know if it's still > there in recent CVS versions. The thinko is that it (used to?) calculate > average service check interval by adding up all normal_check_interval > values and dividing it by the number of services configured (or > something along those lines), which leads to long latencies. This > normally didn't make those latencies increase though. Humm... Well, the numbers sure do get whacky after a restart: first it skyrockets for about five minutes, then plummets to 1s. From there it works its way up the way I described. > > Total Services: 2836 > > Services Checked: 2836 > > Services Scheduled: 2758 > > Active Service Checks:2836 > > Passive Service Checks: 0 > > All services aren't being scheduled, but you have no passive service > checks. Have you disabled checks of 78 services? Oops, forgot to mention that. Yes, a server farm is being rebuilt currently. As I didn't want all the host check timeouts to make matters much, much, worse, I disabled them entirely. > > Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface. > > LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both > > around 40% idle most of the time. I see about 300 context > > switches and 500 interrupts per second. The network load is > > neglible, ditto the packet rate. > > > > The way these figures look I don't see a performance problem per > > se, but maybe I have overlooked a metric that descirbes the > > "usual" bottleneck of installations. > > > > Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel > cpu's, that causes up to 60% performance loss (yes, it really is that bad). Sheesh. Yes, it is a 32-bit installation. I only ever bothered with 64-bit installs on Opteron hardware. I might look into migrating to 64 bits, then. > I'm puzzled. Please let me know if you find the answer to this problem. > I'll help you debug it as best I can, but please continue posting > on-list. Thanks. Sure. I'll first check if the "processor upgrade" and kernel update helped anything, then t
Re: [Nagios-users] Performance issues, too
On Tue, 19 Dec 2006, Andreas Ericsson wrote: > Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel > cpu's, that causes up to 60% performance loss (yes, it really is that bad). I just can answer for my setup (which is almost identical except for i have "only" 1700 service checks so far): my xeon cpus are pure 32 bit stuff... > I'm puzzled. Please let me know if you find the answer to this problem. > I'll help you debug it as best I can, but please continue posting > on-list. Thanks. Me too, i am somewhat out of ideas... Danny -- Q: Gentoo is too hard to install =http://www.cyberdelia.de and I feel like whining. = [EMAIL PROTECTED] A: Please see /dev/null. = (from the gentoo installer FAQ) = \o/ - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Performance issues, too
Thanks for an excellently detailed problem report, missing only the Nagios version and system type/version info. I've got some comments and followup questions. See below. Tobias Klausmann wrote: > Hi! > > Recently I have run into the very same performance issues > as Daniel Meyer (or so it seems). However, I'm not quite sure > about it. Here's the gist of it. > > Currently, service check latency slowly creeps up. As it is now, > it starts out at a little over 1s and after about 12 hours it's > in the area of about 90s. It keeps climbing after that. > > Here's the output of nagios -s: > > HOST SCHEDULING INFORMATION > --- > Total hosts: 330 > Total scheduled hosts: 0 No scheduled host-checks. That's good, cause they interfere with normal operations in Nagios. > Host inter-check delay method: SMART > Average host check interval: 0.00 sec > Host inter-check delay: 0.00 sec > Max host check spread: 10 min > First scheduled check: N/A > Last scheduled check:N/A > > > SERVICE SCHEDULING INFORMATION > --- > Total services: 2836 > Total scheduled services: 2836 > Service inter-check delay method: SMART > Average service check interval: 2225.56 sec This is, as you point out below, quite odd. What's your _longest_ normal_check_interval for services? > Inter-check delay: 0.21 sec > Interleave factor method: SMART > Average services per host: 8.59 > Service interleave factor: 9 > Max service check spread: 10 min > First scheduled check: Tue Dec 19 11:21:45 2006 > Last scheduled check: Tue Dec 19 11:31:47 2006 > > > CHECK PROCESSING INFORMATION > > Service check reaper interval: 5 sec You could lower this to 2 seconds. I've done so on any number of installations and it has no negative impact what so ever, but seems to make Nagios a bit more responsive. > Max concurrent service checks: Unlimited > I assume you aren't running in to hardware limits on this machine. What's the normal load when you're running nagios? If it's > NUM_CPUS then you most likely don't have beefy enough hardware. That's hardly ever the case though, so don't bother looking into it unless all else fails. Nvm, question answered below. Hardware resources should be no problem what so ever. > > This all looks peachy - I think. What I don't get is this line: > > Average service check interval: 2225.56 sec > > It seems to me that this is either a skewed value, stemming from > my history of looong latencies (at one point we were beyonf > 9000 seconds). Nopes. Nagios doesn't bother reading logfiles when it calculates the scheduling numbers. > *Or* it is indicative of a misconfiguration on my > part. If the latter is the case, I'd be eager, nay ecstatic to > hear what I did wrong. Here are a few of the config vars that > might influence this: > There has been a slight thinko in Nagios. I don't know if it's still there in recent CVS versions. The thinko is that it (used to?) calculate average service check interval by adding up all normal_check_interval values and dividing it by the number of services configured (or something along those lines), which leads to long latencies. This normally didn't make those latencies increase though. Humm... > sleep_time=0.25 > service_reaper_frequency=5 > max_concurrent_checks=0 > max_host_check_spread=10 > host_inter_check_delay_method=s > service_interleave_factor=s > command_check_interval=1 > obsess_over_services=0 > aggregate_status_updates=1 > status_update_interval=20 > > Also, here's the output from nagiostats: > Nagios Stats 2.6 > Copyright (c) 2003-2005 Ethan Galstad (www.nagios.org) > Last Modified: 11-27-2006 > License: GPL > > CURRENT STATUS DATA > > Status File: /var/nagios/status.dat > Status File Age: 0d 0h 0m 3s > Status File Version: 2.6 > > Program Running Time: 0d 1h 59m 5s > > Total Services: 2836 > Services Checked: 2836 > Services Scheduled: 2758 > Active Service Checks:2836 > Passive Service Checks: 0 All services aren't being scheduled, but you have no passive service checks. Have you disabled checks of 78 services? > Total Service State Change: 0.000 / 12.370 / 0.007 % > Active Service Latency: 0.006 / 10.237 / 0.906 sec > Active Service Execution Time:0.047 / 10.159 / 0.180 sec > Active Service State Change: 0.000 / 12.370 / 0.007 % > Active Services Last 1/5/15/60 min: 477 / 2678 / 2745 / 2754 > Passive Service State Change: 0.000 / 0.000 / 0.000 % > Passive Services Last 1/5/15/60 min:
[Nagios-users] Performance issues, too
Hi! Recently I have run into the very same performance issues as Daniel Meyer (or so it seems). However, I'm not quite sure about it. Here's the gist of it. Currently, service check latency slowly creeps up. As it is now, it starts out at a little over 1s and after about 12 hours it's in the area of about 90s. It keeps climbing after that. Here's the output of nagios -s: Nagios 2.6 Copyright (c) 1999-2006 Ethan Galstad (http://www.nagios.org) Last Modified: 11-27-2006 License: GPL Warning: Contact group 'Singles-Truppe' is not used in any host/service definitions or host/service escalations! Projected scheduling information for host and service checks is listed below. This information assumes that you are going to start running Nagios with your current config files. HOST SCHEDULING INFORMATION --- Total hosts: 330 Total scheduled hosts: 0 Host inter-check delay method: SMART Average host check interval: 0.00 sec Host inter-check delay: 0.00 sec Max host check spread: 10 min First scheduled check: N/A Last scheduled check:N/A SERVICE SCHEDULING INFORMATION --- Total services: 2836 Total scheduled services: 2836 Service inter-check delay method: SMART Average service check interval: 2225.56 sec Inter-check delay: 0.21 sec Interleave factor method: SMART Average services per host: 8.59 Service interleave factor: 9 Max service check spread: 10 min First scheduled check: Tue Dec 19 11:21:45 2006 Last scheduled check: Tue Dec 19 11:31:47 2006 CHECK PROCESSING INFORMATION Service check reaper interval: 5 sec Max concurrent service checks: Unlimited PERFORMANCE SUGGESTIONS --- I have no suggestions - things look okay. This all looks peachy - I think. What I don't get is this line: Average service check interval: 2225.56 sec It seems to me that this is either a skewed value, stemming from my history of looong latencies (at one point we were beyonf 9000 seconds). *Or* it is indicative of a misconfiguration on my part. If the latter is the case, I'd be eager, nay ecstatic to hear what I did wrong. Here are a few of the config vars that might influence this: sleep_time=0.25 service_reaper_frequency=5 max_concurrent_checks=0 max_host_check_spread=10 host_inter_check_delay_method=s service_interleave_factor=s command_check_interval=1 obsess_over_services=0 aggregate_status_updates=1 status_update_interval=20 Also, here's the output from nagiostats: Nagios Stats 2.6 Copyright (c) 2003-2005 Ethan Galstad (www.nagios.org) Last Modified: 11-27-2006 License: GPL CURRENT STATUS DATA Status File: /var/nagios/status.dat Status File Age: 0d 0h 0m 3s Status File Version: 2.6 Program Running Time: 0d 1h 59m 5s Total Services: 2836 Services Checked: 2836 Services Scheduled: 2758 Active Service Checks:2836 Passive Service Checks: 0 Total Service State Change: 0.000 / 12.370 / 0.007 % Active Service Latency: 0.006 / 10.237 / 0.906 sec Active Service Execution Time:0.047 / 10.159 / 0.180 sec Active Service State Change: 0.000 / 12.370 / 0.007 % Active Services Last 1/5/15/60 min: 477 / 2678 / 2745 / 2754 Passive Service State Change: 0.000 / 0.000 / 0.000 % Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0 Services Ok/Warn/Unk/Crit:2814 / 6 / 0 / 16 Services Flapping:0 Services In Downtime: 0 Total Hosts: 330 Hosts Checked:330 Hosts Scheduled: 0 Active Host Checks: 330 Passive Host Checks: 0 Total Host State Change: 0.000 / 0.000 / 0.000 % Active Host Latency: 0.000 / 1.000 / 0.888 sec Active Host Execution Time: 0.030 / 4.059 / 0.112 sec Active Host State Change: 0.000 / 0.000 / 0.000 % Active Hosts Last 1/5/15/60 min: 0 / 12 / 12 / 12 Passive Host State Change:0.000 / 0.000 / 0.000 % Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0 Hosts Up/Down/Unreach:329 / 1 / 0 Hosts Flapping: 0 Hosts In Downtime:0 Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface. LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both around 40% idle most of the time. I see about 300 context switches and 500 interrupts per second. The network load is neglible, ditto the packet rate. The way these figures look I don't see a performance problem per se, but maybe I have