Re: [Nagios-users] Nagios checkresults queue grows over time

2008-02-11 Thread Justin Hitt
Update on 'checkresults' queue growth, Nagios 3.0 rc1 ...
http://www.nagiosexchange.org/nagios-users.34.0.html?&tx_maillisttofaq_pi1[mode]=1&tx_maillisttofaq_pi1[showUid]=9116

I can keep the system from coming down completely by eliminating host
checks.  It seems the rapid growth of checks is nagios reading stale
entries, then scheduling recheck, which then becomes stale because
nagios doesn't get it in time to process.

Without host checks, I get fewer ...
[1202750959] Warning: The check of service 'URL' on host 'FQDN0.com'
looks like it was orphaned (results never came back).  I'm scheduling
an immediate check of the service...
... in the logs.

Has any 'checkresults' queuing issues been resolved in RC2 ... I
didn't see anything specific in the Changelog?  Anyone else
experiencing a queue that grows slowly overtime and not processing
service checks in a timely manner?

Best,

Justin
-- 
Attention Sales And Marketing Professionals Who Serve B2B Executives
   http://hittpublishingdirect.com/

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Nagios checkresults queue grows over time

2008-02-08 Thread Justin Hitt
I have two Nagios 3.0 cr1 systems, (A) on a 2.8ghz Solaris 10 system
with 212 hosts and (B) the other on VPS multiple core system with
2,916 hosts.  Both systems, after the initial host check, has it's
[/usr/local/nagios/var/spool/checkresults] grow in size till nagios in
non responsive.

(A) Has a modified configuration with a longer
"cached_host_check_horizon=2700" and
"cached_service_check_horizon=1800".  I tried to stretch out the time
frame that checks were accepted.

(B) Has a more standard configuration with reasonable cache counts.

Both systems are using "use_large_installation_tweaks=1" and otherwise
are standardly configured.  Each system allows 45 minutes to finish
the host checks.  I've also tried this configuration without host
checks.

Both systems have very low CPU utilization after the initial host
check and hardly go over 20% during regular operations.

The checkresults queue does go up and down in the number of 'check'
files, often dropping down as much as 200 checks, the popping backup
twice as much.  I've tried tuning the "max_check_result_file_age=3600"
which tends to make the queue last longer.

I'm also purging the queue of files older than 90 minutes with ...
0,15,30,45 * * * * ( /usr/local/bin/find
/usr/local/nagios/var/spool/checkresults -type f -mmin +90 -exec
/bin/rm -f {} \; ) > /dev/null 2>&1
... in the crontab.

Finally, here's what I see in the log files ...
[1202485459] Warning: The check of host 'FQDN0.com' looks like it was
orphaned (results never came back).  I'm scheduling an immediate check
of the host...
[1202485459] Warning: The check of host 'FQDN1.com' looks like it was
orphaned (results never came back).  I'm scheduling an immediate check
of the host...
[1202485459] Warning: The check of host 'FQDN2.com' looks like it was
orphaned (results never came back).  I'm scheduling an immediate check
of the host...
... which again is why I tuned the "max_check_result_file" and am
purging the queue of really old files.  (I've also tested very short
"max_check_result_file", at the current setting I've minimized
flapping.)

Other checks that didn't improve the situation ...
 -- Nice'd the nagios process to give highest priority possible.
Increased CPU load a little, but over time got the same idle
conditions after checks where complete.
 -- Stretched out checks to > 15 minutes for critical services and > 2
hours for "nice to know about" services.  Made queues fill up less
frequently.
 -- Looked at disk performance and swapping.  Neither system is
swapping nor does it have bottlenecks around disk issues.

With the purge routine, I won't see a file in the queue older than 90
minutes.  Does this mean "max_check_result_file" isn't working?  What
other parameters can I adjust?  Anyone have any ideas of what's going
on?

Best,

Justin

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null