[Nagios-users] High latency issues on Nagios 3.2.1

2010-04-06 Thread Cary Petterborg
We are currently working towards migrating from Nagios 2.7 to 3.2. We have 
37,000+ services and 3,000+ hosts. We have a test environment with an 8 CPU 
system running Nagios 3.2.1 and we are getting high latency of 330+ seconds. 
The configuration has the large installation tweaks turned on and 
max_concurrent_checks=1000. The load average on the system is around 3 to 
4, but the CPU utilization is less then 50% on the average, with peaks of 80+% 
that might last 1 second about every 10 or 15 seconds.

So my question is this - Is there something that we can do to lower the latency 
and increase the CPU utilization? Is there some limiting factor with our 
configuration that we need to tweak, or is it just too many checks for the main 
Nagios process to handle in the time frame, or something else?

I can provide any information that would make it possible to lower the latency.

Thanks!!


Cary Petterborg
ICS Monitoring
The Church of Jesus Christ of Latter-day Saints
Office Phone: 801-240-8267
Email:  petterbor...@ldschurch.org


 NOTICE: This email message is for the sole use of the intended recipient(s) 
and may contain confidential and privileged information. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.



--
Download IntelĀ® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Nagios 3.2 and max_concurrent_checks=0

2009-12-15 Thread Cary Petterborg
I'm doing some testing for migrating our installation to Nagios3.2. The test 
server is running 3.2 on an 8 CPU box with 34,000 active service checks and 
3,000 active host checks. The initial configuration file had 
max_concurrent_checks=0, but latency was about 9,000 seconds. I changed it to 
max_concurrent_checks=200 and the latency went down to about 7,000 seconds. I 
then set it to 2,000 and the latency dropped to about 200 seconds. I currently 
have it set to 100,000 and latency has not changed from about 200 seconds.

>From all the documentation I have seen, if max_concurrent_checks is set to 
>zero, there should be no limit on the number of concurrent checks, but this 
>doesn't appear to be the case. Is there some other part of the configuration 
>that I'm missing which would make max_concurrent_checks=0 be limited instead 
>of unlimited?


Cary Petterborg
ICS Monitoring
The Church of Jesus Christ of Latter-day Saints
Office Phone: 801-240-8267
Email:  petterbor...@ldschurch.org


 NOTICE: This email message is for the sole use of the intended recipient(s) 
and may contain confidential and privileged information. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.



--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] A fix for EXTREME slowness

2008-04-01 Thread Cary Petterborg
I'm resending this email because there was not a single response to my
previous email. I have to think that someone else has run into this
problem, and I would like to know what others have done and suggestions
for the implementation of a good fix. We have solved this problem in the
short term, but we want to implement a more robust long terms solution.
We had a huge performance increase from fixing the problem, so if you
have noticed your web server taking a long time to process your
status.cgi and extinfo.cgi page requests, please read on.


This email has a description of the problem, the symptoms, our interim
fix, and a possible long term fix. If you have been noticing large (or
larger) load times for status.cgi and/or extinfo.cgi, please read this
entire message.


We have recently had our comments.dat file grow to a much larger size
(due to increased need for comments). This file grew to about 4.8MB. To
read or write this size of file is not a problem, but the processing of
it in status.cgi and extinfo.cgi was slowing things down significantly.
To give you an idea, the page load times went from a few seconds to over
a minute on our production systems.

Since the load times were so bad we started looking for the cause. It
became evident that it was the processing of the comments.dat file. We
created a program to take the comments more than 30 days old and archive
them into an archive file. The reduces the load time so significantly
that we decided to do some tests on a non-production system.

We took the large 4.8MB file and reduced the number of entries until
there were only 30 days worth in the file (down to 90, 80, 70, 60, 50,
40 and finally 30 days). Then we ran tests on status.cgi for each of
these filesizes. Using just a crude stopwatch we measured the times it
took to load the various pages. I have created a spreadsheet file and
graph for the data. The test seems to indicate that the size of the
comments.dat file dramatically affects the page load times. On the test
server, the load times for the 4.8MB file were in the 9 to 10 second
range, while the 2MB file were under 2 seconds. Here is a table of the
results:

File size | Time
-
4.85  | 9.5
3.95  | 6.0
3.00  | 2.8
2.03  | 2.0

This seems to show rather exponential growth rather than linear. We have
ended up in the short term archiving the old data, reducing the file to
the much more reasonable 2MB size and cutting the times significantly.

The results on the production server is even more dramatic reducing the
load time from 70 seconds to about 3 seconds. This was more of a problem
on the production system because there were more status.cgi processes
running at the same time. A 95% reduction in the load time is very
significant.

Are there others who have seen this as a big problem, or is it not a
typical problem that has been encountered? Have others found a way to
fix this problem other than reducing the number of comments in the
comments file?

So there seems to be a need to make this information be more a database
type access, rather than a "parse this big file and see what drops out
that we want" access. This could easily be done with a real relational
database, or even a more simple database, to retrieve only the comments
for the host/service desired. We are willing to do the work on this, but
would like it to be incorporated into Nagios code base so that we are
not having to port this functionality on upgrades in the future.

If you are interested in this type of enhancement, please let me know.
In addition, if you have suggestions for the implementation of real
comments database (yes, we are experienced in this area, and have OUR
ideas of how we want to implement it, but we'd like to know of other
opinions so that we can increase the likelihood of it being incorporated
into the standard release), please let me know.

Thanks!

Cary Petterborg


--
NOTICE: This email message is for the sole use of the intended recipient(s) and 
may contain confidential and privileged information. Any unauthorized review, 
use, disclosure or distribution is prohibited. If you are not the intended 
recipient, please contact the sender by reply email and destroy all copies of 
the original message.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] comments.dat file size causing EXTREME slowness

2008-03-25 Thread Cary Petterborg
This email has a description of the problem, the symptoms, our interim
fix, and a possible long term fix. If you have been noticing large (or
larger) load times for status.cgi and/or extinfo.cgi, please read this
entire message.


We have recently had our comments.dat file grow to a much larger size
(due to increased need for comments). This file grew to about 4.8MB. To
read or write this size of file is not a problem, but the processing of
it in status.cgi and extinfo.cgi was slowing things down significantly.
To give you an idea, the page load times went from a few seconds to over
a minute on our production systems.

Since the load times were so bad we started looking for the cause. It
became evident that it was the processing of the comments.dat file. We
created a program to take the comments more than 30 days old and archive
them into an archive file. The reduces the load time so significantly
that we decided to do some tests on a non-production system.

We took the large 4.8MB file and reduced the number of entries until
there were only 30 days worth in the file (down to 90, 80, 70, 60, 50,
40 and finally 30 days). Then we ran tests on status.cgi for each of
these filesizes. Using just a crude stopwatch we measured the times it
took to load the various pages. I have created a spreadsheet file and
graph for the data. The test seems to indicate that the size of the
comments.dat file dramatically affects the page load times. On the test
server, the load times for the 4.8MB file were in the 9 to 10 second
range, while the 2MB file were under 2 seconds. Here is a table of the
results:

File size | Time
-
4.85  | 9.5
3.95  | 6.0
3.00  | 2.8
2.03  | 2.0

This seems to show rather exponential growth rather than linear. We have
ended up in the short term archiving the old data, reducing the file to
the much more reasonable 2MB size and cutting the times significantly.

The results on the production server is even more dramatic reducing the
load time from 70 seconds to about 3 seconds. This was more of a problem
on the production system because there were more status.cgi processes
running at the same time. A 95% reduction in the load time is very
significant.

Are there others who have seen this as a big problem, or is it not a
typical problem that has been encountered? Have others found a way to
fix this problem other than reducing the number of comments in the
comments file?

So there seems to be a need to make this information be more a database
type access, rather than a "parse this big file and see what drops out
that we want" access. This could easily be done with a real relational
database, or even a more simple database, to retrieve only the comments
for the host/service desired. We are willing to do the work on this, but
would like it to be incorporated into Nagios code base so that we are
not having to port this functionality on upgrades in the future.

If you are interested in this type of enhancement, please let me know.
In addition, if you have suggestions for the implementation of real
comments database (yes, we are experienced in this area, and have OUR
ideas of how we want to implement it, but we'd like to know of other
opinions so that we can increase the likelihood of it being incorporated
into the standard release), please let me know.

Thanks!

Cary Petterborg

--
NOTICE: This email message is for the sole use of the intended recipient(s) and 
may contain confidential and privileged information. Any unauthorized review, 
use, disclosure or distribution is prohibited. If you are not the intended 
recipient, please contact the sender by reply email and destroy all copies of 
the original message.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] bug? - After host Scheduled Downtime, alarming services don't send notifications

2008-03-20 Thread Cary Petterborg
Nagios 2.5: We've had some cases where a host comes out of scheduled
downtime, but a service is still in critical. No notifications are sent
out about this service. Is this the proper behavior or a bug? We feel it
is a bug. If it is a bug, has it been fixed in later releases (later
then 2.5)?

Also related - If a host comes out of scheduled downtime, and it's still
in an alert state, will the notification number be reset or will it
continue with an increasing number? We feel it should reset the number,
but if it isn't, is there a reason it is not reset?

Thanks!

Cary

--
NOTICE: This email message is for the sole use of the intended recipient(s) and 
may contain confidential and privileged information. Any unauthorized review, 
use, disclosure or distribution is prohibited. If you are not the intended 
recipient, please contact the sender by reply email and destroy all copies of 
the original message.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Q: Need to use a default user, but still allow changing to another user

2008-03-18 Thread Cary Petterborg
We are trying to make things easy for "managers" who want to look at
statuses without logging in (it is a request by the managers, not
something WE thought up on our own to help them). This can be done by
setting a default user, right? So you set the default user, but then you
can't log in as a different user to get different views, etc.

Does anyone have a solution that they are using for this type of case? I
know I can get around this doing some programming, but if someone
already cracked this nut, it would save me a lot of time for other work.

Thanks!

Cary

--
NOTICE: This email message is for the sole use of the intended recipient(s) and 
may contain confidential and privileged information. Any unauthorized review, 
use, disclosure or distribution is prohibited. If you are not the intended 
recipient, please contact the sender by reply email and destroy all copies of 
the original message.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null