[Nagios-users] High latency issues on Nagios 3.2.1
We are currently working towards migrating from Nagios 2.7 to 3.2. We have 37,000+ services and 3,000+ hosts. We have a test environment with an 8 CPU system running Nagios 3.2.1 and we are getting high latency of 330+ seconds. The configuration has the large installation tweaks turned on and max_concurrent_checks=1000. The load average on the system is around 3 to 4, but the CPU utilization is less then 50% on the average, with peaks of 80+% that might last 1 second about every 10 or 15 seconds. So my question is this - Is there something that we can do to lower the latency and increase the CPU utilization? Is there some limiting factor with our configuration that we need to tweak, or is it just too many checks for the main Nagios process to handle in the time frame, or something else? I can provide any information that would make it possible to lower the latency. Thanks!! Cary Petterborg ICS Monitoring The Church of Jesus Christ of Latter-day Saints Office Phone: 801-240-8267 Email: petterbor...@ldschurch.org NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -- Download IntelĀ® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Nagios 3.2 and max_concurrent_checks=0
I'm doing some testing for migrating our installation to Nagios3.2. The test server is running 3.2 on an 8 CPU box with 34,000 active service checks and 3,000 active host checks. The initial configuration file had max_concurrent_checks=0, but latency was about 9,000 seconds. I changed it to max_concurrent_checks=200 and the latency went down to about 7,000 seconds. I then set it to 2,000 and the latency dropped to about 200 seconds. I currently have it set to 100,000 and latency has not changed from about 200 seconds. >From all the documentation I have seen, if max_concurrent_checks is set to >zero, there should be no limit on the number of concurrent checks, but this >doesn't appear to be the case. Is there some other part of the configuration >that I'm missing which would make max_concurrent_checks=0 be limited instead >of unlimited? Cary Petterborg ICS Monitoring The Church of Jesus Christ of Latter-day Saints Office Phone: 801-240-8267 Email: petterbor...@ldschurch.org NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] A fix for EXTREME slowness
I'm resending this email because there was not a single response to my previous email. I have to think that someone else has run into this problem, and I would like to know what others have done and suggestions for the implementation of a good fix. We have solved this problem in the short term, but we want to implement a more robust long terms solution. We had a huge performance increase from fixing the problem, so if you have noticed your web server taking a long time to process your status.cgi and extinfo.cgi page requests, please read on. This email has a description of the problem, the symptoms, our interim fix, and a possible long term fix. If you have been noticing large (or larger) load times for status.cgi and/or extinfo.cgi, please read this entire message. We have recently had our comments.dat file grow to a much larger size (due to increased need for comments). This file grew to about 4.8MB. To read or write this size of file is not a problem, but the processing of it in status.cgi and extinfo.cgi was slowing things down significantly. To give you an idea, the page load times went from a few seconds to over a minute on our production systems. Since the load times were so bad we started looking for the cause. It became evident that it was the processing of the comments.dat file. We created a program to take the comments more than 30 days old and archive them into an archive file. The reduces the load time so significantly that we decided to do some tests on a non-production system. We took the large 4.8MB file and reduced the number of entries until there were only 30 days worth in the file (down to 90, 80, 70, 60, 50, 40 and finally 30 days). Then we ran tests on status.cgi for each of these filesizes. Using just a crude stopwatch we measured the times it took to load the various pages. I have created a spreadsheet file and graph for the data. The test seems to indicate that the size of the comments.dat file dramatically affects the page load times. On the test server, the load times for the 4.8MB file were in the 9 to 10 second range, while the 2MB file were under 2 seconds. Here is a table of the results: File size | Time - 4.85 | 9.5 3.95 | 6.0 3.00 | 2.8 2.03 | 2.0 This seems to show rather exponential growth rather than linear. We have ended up in the short term archiving the old data, reducing the file to the much more reasonable 2MB size and cutting the times significantly. The results on the production server is even more dramatic reducing the load time from 70 seconds to about 3 seconds. This was more of a problem on the production system because there were more status.cgi processes running at the same time. A 95% reduction in the load time is very significant. Are there others who have seen this as a big problem, or is it not a typical problem that has been encountered? Have others found a way to fix this problem other than reducing the number of comments in the comments file? So there seems to be a need to make this information be more a database type access, rather than a "parse this big file and see what drops out that we want" access. This could easily be done with a real relational database, or even a more simple database, to retrieve only the comments for the host/service desired. We are willing to do the work on this, but would like it to be incorporated into Nagios code base so that we are not having to port this functionality on upgrades in the future. If you are interested in this type of enhancement, please let me know. In addition, if you have suggestions for the implementation of real comments database (yes, we are experienced in this area, and have OUR ideas of how we want to implement it, but we'd like to know of other opinions so that we can increase the likelihood of it being incorporated into the standard release), please let me know. Thanks! Cary Petterborg -- NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] comments.dat file size causing EXTREME slowness
This email has a description of the problem, the symptoms, our interim fix, and a possible long term fix. If you have been noticing large (or larger) load times for status.cgi and/or extinfo.cgi, please read this entire message. We have recently had our comments.dat file grow to a much larger size (due to increased need for comments). This file grew to about 4.8MB. To read or write this size of file is not a problem, but the processing of it in status.cgi and extinfo.cgi was slowing things down significantly. To give you an idea, the page load times went from a few seconds to over a minute on our production systems. Since the load times were so bad we started looking for the cause. It became evident that it was the processing of the comments.dat file. We created a program to take the comments more than 30 days old and archive them into an archive file. The reduces the load time so significantly that we decided to do some tests on a non-production system. We took the large 4.8MB file and reduced the number of entries until there were only 30 days worth in the file (down to 90, 80, 70, 60, 50, 40 and finally 30 days). Then we ran tests on status.cgi for each of these filesizes. Using just a crude stopwatch we measured the times it took to load the various pages. I have created a spreadsheet file and graph for the data. The test seems to indicate that the size of the comments.dat file dramatically affects the page load times. On the test server, the load times for the 4.8MB file were in the 9 to 10 second range, while the 2MB file were under 2 seconds. Here is a table of the results: File size | Time - 4.85 | 9.5 3.95 | 6.0 3.00 | 2.8 2.03 | 2.0 This seems to show rather exponential growth rather than linear. We have ended up in the short term archiving the old data, reducing the file to the much more reasonable 2MB size and cutting the times significantly. The results on the production server is even more dramatic reducing the load time from 70 seconds to about 3 seconds. This was more of a problem on the production system because there were more status.cgi processes running at the same time. A 95% reduction in the load time is very significant. Are there others who have seen this as a big problem, or is it not a typical problem that has been encountered? Have others found a way to fix this problem other than reducing the number of comments in the comments file? So there seems to be a need to make this information be more a database type access, rather than a "parse this big file and see what drops out that we want" access. This could easily be done with a real relational database, or even a more simple database, to retrieve only the comments for the host/service desired. We are willing to do the work on this, but would like it to be incorporated into Nagios code base so that we are not having to port this functionality on upgrades in the future. If you are interested in this type of enhancement, please let me know. In addition, if you have suggestions for the implementation of real comments database (yes, we are experienced in this area, and have OUR ideas of how we want to implement it, but we'd like to know of other opinions so that we can increase the likelihood of it being incorporated into the standard release), please let me know. Thanks! Cary Petterborg -- NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] bug? - After host Scheduled Downtime, alarming services don't send notifications
Nagios 2.5: We've had some cases where a host comes out of scheduled downtime, but a service is still in critical. No notifications are sent out about this service. Is this the proper behavior or a bug? We feel it is a bug. If it is a bug, has it been fixed in later releases (later then 2.5)? Also related - If a host comes out of scheduled downtime, and it's still in an alert state, will the notification number be reset or will it continue with an increasing number? We feel it should reset the number, but if it isn't, is there a reason it is not reset? Thanks! Cary -- NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Q: Need to use a default user, but still allow changing to another user
We are trying to make things easy for "managers" who want to look at statuses without logging in (it is a request by the managers, not something WE thought up on our own to help them). This can be done by setting a default user, right? So you set the default user, but then you can't log in as a different user to get different views, etc. Does anyone have a solution that they are using for this type of case? I know I can get around this doing some programming, but if someone already cracked this nut, it would save me a lot of time for other work. Thanks! Cary -- NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null