Re: [Nagios-users] Performance issues, too

2007-01-09 Thread Tobias Klausmann
Hi! 

On Tue, 02 Jan 2007, Daniel Meyer wrote:
> Program Running Time: 10d 21h 22m 42s
> 
> So, for almost eleven days nagios runs smoothly now, no more
> latency problems. I'll try it again with EPN (but still without
> perlcache) now.

I've finally gotten around to recompile Nagios without EPN and
without the Perlcache. As you can see on these graphs:

http://eric.schwarzvogel.de/~klausman/nagios-perf-3/

(especially
http://eric.schwarzvogel.de/~klausman/nagios-perf-3/latencies.png
)

I didn't quite help (much). While the curve now has a flatter
slope and it even goes down in spots, it still seems to ever
increase on the whole. Even it would stay on the level we saw
last night (~100s check latency) I wouldn't be too happy. With a
300s check interval, 100s latency is just too much (IMHO).

What's left is enabling Perlcache again (yet keeping EPN off).
I'm not terribly hopeful that that will help, but I'm running out
of ideas quickly.

Also note that switching *off* EPN/PC led to *less* CPU usage. 
Strange, isn't it?

Regards,
Tobias
-- 
Never touch a burning system.

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2007-01-03 Thread Andreas Ericsson
Robert Hajime Lanning wrote:
> 
>> Just rechecked. After 72 hours nagios still runs perfectly
>> with an average service check latency of 0.3 seconds, max.
>> 0.9 seconds.
>>
>> Memory usage is perfectly "flat" now, with epn and perlcache
>> it went from 140 mb (whole system) to about 900 mb within 24h.
>>
>> The average system load is a bit _lower_ than before, but some
>> peaks higher than with epn/perlcache.
>>
>> I'll try pure epn without perlcache first thing in january.
> 
> The main reason for me to use ePN with perlcache, is to get
> around the huge load of loading all the MIBs for each SNMP
> query.  (Since 90% of my services are SNMP queries.)  I was
> looking for a way to load the MIB tree once, and found I could
> do it in p1.pl.
> 

If you use SNMP oid's rather than their "human-readable" mib-names, you 
don't need to load a single mib. It is indeed a much simpler solution.

> For traps, I run snmptrapd (from net-snmp) and have just recently
> found it has a memory leak.  Over the course of 20 days, it grew
> from 5MB to 140MB.  It runs snmptthandler, which is actually a C
> program (I ported the Perl version to reduce the load during trap
> floods).
> 
> snmptt has a big memory leak.  I restart it every 6 hours.
> 
> This seems to be pointing to the net-snmp libraries.
> 
> Though, I don't get why it would really effect the nagios master
> process.  Since all the calls to the SNMP module are run in a
> subprocess, other than the initialization that I put into p1.pl.
> Unless p1.pl is executed more than once.
> 

strace -e open ./nagios  2>&1 | grep p1.pl

will tell you, although strace might not be included in the tool-box 
shipped with your system.

> Back when I had about 200 service checks, my load was about 1.5.
> Then I enabled ePN with perlcache and stuck in the "use SNMP"
> with the preload of the MIBs.  Load went down to 0.3.  But, as
> I added services, most SNMP, this issue showed up.
> 

Try without perlcache, and try with OID's and without your p1.pl hack.

-- 
Andreas Ericsson   [EMAIL PROTECTED]
OP5 AB www.op5.se
Tel: +46 8-230225  Fax: +46 8-230231

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2007-01-03 Thread Andreas Ericsson
Robert Hajime Lanning wrote:
> I have also been having performance issues with Nagios 2.5 on
> a Sun E220R with two 400MHz procs and 1GB ram.
> 
> Sys stats are at http://lanning.cc/kipper.html
> 
> The large dips in load and system CPU time are when I restart
> Nagios.  (cron'd twice a week, but I have also been making
> a lot of service updates lately, hence the almost once a day
> restarts.)  For the restarts to fix the latency, I have
> "use_retained_scheduling_info=0".
> 
> After about three days the Service Check latency will grow
> to over 300 seconds.  It is usually steady at around 0-5
> seconds, for a couple of days, then it will rise over the
> course of a few hours to over the 300 second mark.
> 

This is a bit bizarre and simply must be related to something else. Does 
Nagios run out of commandbuffer slots? Aren't they freed properly?

> 
> I have noticed the Nagios seems to have a memory leak.  As,
> I have watched over the last hour the process grow from 124M
> to 126M.
> 

This can probably be attributed to the fact that Nagios fork()'s, then 
frees and allocates memory before running execve() in a thread. This 
isn't per se prohibited, but strongly discouraged. I wouldn't be 
surprised to find that other applications that do the same thing will 
leak memory on Sun. On Linux, threads are created in a 1-1 fashion 
(meaning each thread is actually its own process). This holds true for 
some other systems as well, and afaik there are 1-1 thread 
implementations for Sun as well. In any case, the 1-1 thing means that 
the kernel cleans up any left-over memory for the processes when they 
exit, which isn't necessarily the case in a 1-many relationship thread 
implementation. Possibly worth investigating.

> I use ePN with caching.  Most of my checks are SNMP requests
> via ePN scripts (http://lanning.cc/custom_plugins/), with
> p1.pl modified with:
> 
>   use SNMP 5.0;
>   SNMP::loadModules("ALL");
> 

Forgive a novice, but doesn't this make it load all SNMP submodules each 
time it runs a perl-module? That would certainly be a major impact on 
load and could well lead to memory leaks (assuming the submodules aren't 
always freed after having been loaded).

> We have put into our budget to move Nagios to a Linux/Intel
> server.  But, what bugs me is the high CPU time in kernel
> space, because of Nagios.
> 

Again, this is a behaviour not regularly experienced on Linux (which is 
the base for most Nagios installations). Linux is simply very, very good 
at fork(). It doesn't do bother even trying to do other things properly 
(like 1-many threading), simply because it's so damn good at forking. It 
would be interesting to see if your problems go away when you move to 
Linux. I'm not saying it's superior to Solaris, but afaiu, Ethan runs 
all his tests on Linux and would certainly have found bugs of this kind 
if they had bitten him.

-- 
Andreas Ericsson   [EMAIL PROTECTED]
OP5 AB www.op5.se
Tel: +46 8-230225  Fax: +46 8-230231

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2007-01-01 Thread Daniel Meyer
Hi there, and happy new year :-)

Program Running Time: 10d 21h 22m 42s

So, for almost eleven days nagios runs smoothly now, no more latency 
problems. I'll try it again with EPN (but still without perlcache) now.

Danny
-- 
Q: Gentoo is too hard to install  =http://www.cyberdelia.de
and I feel like whining.   = [EMAIL PROTECTED]
A: Please see /dev/null.  =
   (from the gentoo installer FAQ) = \o/

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-28 Thread Robert Hajime Lanning


> On Mon, 25 Dec 2006, Robert Hajime Lanning wrote:
>> I have a few that use the output of the last check to see
>> differences in accumulators and the like.  And I see that
>> the caching code caches a parsed version of the arguments.
>> This caching has no expirations just appending the new
>> argument list.
>
> That might explain memory consumption, though one has to wonder
> if linear increase is fast enough to explain it. If the
> arguments get *doubled* everytime, though...

Ok, I have done two things.

1) removed caching of arguments (it now parses arguments every
   time, in the child process)

2) I have modified all my perl based checks to use "my" instead
   of "use vars".  This is to scope the variables to the package
   that the service check is encapsulated in.

So, now my load seems to have lowered to about 1.5 from 2.2.
The CPU time in kernel space no longer grows in that curve
fashion as seen in the graphs posted earlier.  It still grows
but now more linearly and at a slower pace.

I started a cron job, every five minutes, that logs the size
of the master Nagios process in kilobytes. (pagesize is 8k)

The drop from 14920 to 11920 was the cron'd restart of Nagios.
[Thu Dec 28 00:00:01 UTC 2006] 14920
[Thu Dec 28 00:05:00 UTC 2006] 11920
[Thu Dec 28 00:10:00 UTC 2006] 11960
[Thu Dec 28 00:15:00 UTC 2006] 11992
[Thu Dec 28 00:20:00 UTC 2006] 12000
[Thu Dec 28 00:25:00 UTC 2006] 12000
[Thu Dec 28 00:30:00 UTC 2006] 12000
[Thu Dec 28 00:35:00 UTC 2006] 12024
[Thu Dec 28 00:40:00 UTC 2006] 12032
[Thu Dec 28 00:45:00 UTC 2006] 12048
[Thu Dec 28 00:50:00 UTC 2006] 12056
[Thu Dec 28 00:55:00 UTC 2006] 12072

The service check cache code is implemented in the p1.pl,
so I have been really looking at it.

Once the check is compiled, there is a really short code
path in p1.pl for the Nagios master process.  So, I think
the leak is more in ePN, than perlcache.

-- 
And, did Galoka think the Ulus were too ugly to save?
 -Centauri


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-26 Thread Tobias Klausmann
Hi! 

On Mon, 25 Dec 2006, Robert Hajime Lanning wrote:
> > I think the two issues are independent (or at most correlated).
> > If switching off EPN/perlcache fixes the issues for me, too, I'd
> > guess it's either the embedded Perl or the cache. Finding out
> > which is a matter of simple experimentation. I hope :)
> >
> 
> Does any of your checks have arguments that change?

No, I don't think so. If there's no implicit carry-over in a
plugin, we don't do that at all.

> I have a few that use the output of the last check to see
> differences in accumulators and the like.  And I see that
> the caching code caches a parsed version of the arguments.
> This caching has no expirations just appending the new
> argument list.

That might explain memory consumption, though one has to wonder
if linear increase is fast enough to explain it. If the
arguments get *doubled* everytime, though...

> I am trying to comment out the caching of arguments and have
> the arguments parsed each time.

Good luck.

> > Merry christmas to the lot of you, btw.
> >
> > Regards,
> > Tobias
> > (away from work and Nagios 'til January 8th)
> 
> Merry Christmas, and I am too much a geek to leave this be,
> until January. :)  (Have to tinker...)

Oh, I do have my own private projects I can tinker with :)

Regards,
Tobias
-- 
Never touch a burning system.

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-25 Thread Robert Hajime Lanning


> I'm not using a single SNMP check, and I have the very same
> problem: so I'd say no.

Ok, seperate issues... :)

> I think the two issues are independent (or at most correlated).
> If switching off EPN/perlcache fixes the issues for me, too, I'd
> guess it's either the embedded Perl or the cache. Finding out
> which is a matter of simple experimentation. I hope :)
>

Does any of your checks have arguments that change?

I have a few that use the output of the last check to see
differences in accumulators and the like.  And I see that
the caching code caches a parsed version of the arguments.
This caching has no expirations just appending the new
argument list.

I am trying to comment out the caching of arguments and have
the arguments parsed each time.

> Merry christmas to the lot of you, btw.
>
> Regards,
> Tobias
> (away from work and Nagios 'til January 8th)

Merry Christmas, and I am too much a geek to leave this be,
until January. :)  (Have to tinker...)

-- 
And, did Galoka think the Ulus were too ugly to save?
 -Centauri


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-25 Thread Tobias Klausmann
Hi! 

On Mon, 25 Dec 2006, Robert Hajime Lanning wrote:

> 
> 
> > Just rechecked. After 72 hours nagios still runs perfectly
> > with an average service check latency of 0.3 seconds, max.
> > 0.9 seconds.
> >
> > Memory usage is perfectly "flat" now, with epn and perlcache
> > it went from 140 mb (whole system) to about 900 mb within 24h.
> >
> > The average system load is a bit _lower_ than before, but some
> > peaks higher than with epn/perlcache.
> >
> > I'll try pure epn without perlcache first thing in january.

(pardon my butting in here) I'll do that, too.

> The main reason for me to use ePN with perlcache, is to get
> around the huge load of loading all the MIBs for each SNMP
> query.  (Since 90% of my services are SNMP queries.)  I was
> looking for a way to load the MIB tree once, and found I could
> do it in p1.pl.
> 
> For traps, I run snmptrapd (from net-snmp) and have just recently
> found it has a memory leak.  Over the course of 20 days, it grew
> from 5MB to 140MB.  It runs snmptthandler, which is actually a C
> program (I ported the Perl version to reduce the load during trap
> floods).
> 
> snmptt has a big memory leak.  I restart it every 6 hours.
> 
> This seems to be pointing to the net-snmp libraries.

I'm not using a single SNMP check, and I have the very same
problem: so I'd say no.

> Though, I don't get why it would really effect the nagios master
> process.  Since all the calls to the SNMP module are run in a
> subprocess, other than the initialization that I put into p1.pl.
> Unless p1.pl is executed more than once.
> 
> Back when I had about 200 service checks, my load was about 1.5.
> Then I enabled ePN with perlcache and stuck in the "use SNMP"
> with the preload of the MIBs.  Load went down to 0.3.  But, as
> I added services, most SNMP, this issue showed up.

I think the two issues are independent (or at most correlated).
If switching off EPN/perlcache fixes the issues for me, too, I'd
guess it's either the embedded Perl or the cache. Finding out
which is a matter of simple experimentation. I hope :)

Merry christmas to the lot of you, btw. 

Regards,
Tobias
(away from work and Nagios 'til January 8th)

-- 
Never touch a burning system.

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-25 Thread Robert Hajime Lanning


> Just rechecked. After 72 hours nagios still runs perfectly
> with an average service check latency of 0.3 seconds, max.
> 0.9 seconds.
>
> Memory usage is perfectly "flat" now, with epn and perlcache
> it went from 140 mb (whole system) to about 900 mb within 24h.
>
> The average system load is a bit _lower_ than before, but some
> peaks higher than with epn/perlcache.
>
> I'll try pure epn without perlcache first thing in january.

The main reason for me to use ePN with perlcache, is to get
around the huge load of loading all the MIBs for each SNMP
query.  (Since 90% of my services are SNMP queries.)  I was
looking for a way to load the MIB tree once, and found I could
do it in p1.pl.

For traps, I run snmptrapd (from net-snmp) and have just recently
found it has a memory leak.  Over the course of 20 days, it grew
from 5MB to 140MB.  It runs snmptthandler, which is actually a C
program (I ported the Perl version to reduce the load during trap
floods).

snmptt has a big memory leak.  I restart it every 6 hours.

This seems to be pointing to the net-snmp libraries.

Though, I don't get why it would really effect the nagios master
process.  Since all the calls to the SNMP module are run in a
subprocess, other than the initialization that I put into p1.pl.
Unless p1.pl is executed more than once.

Back when I had about 200 service checks, my load was about 1.5.
Then I enabled ePN with perlcache and stuck in the "use SNMP"
with the preload of the MIBs.  Load went down to 0.3.  But, as
I added services, most SNMP, this issue showed up.

-- 
And, did Galoka think the Ulus were too ugly to save?
 -Centauri


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-25 Thread Daniel Meyer
On Sun, 24 Dec 2006, Joerg Linge wrote:

>> I have watched over the last hour the process grow from 124M
>> to 126M.
>>
>> I use ePN with caching.  Most of my checks are SNMP requests
>> via ePN scripts (http://lanning.cc/custom_plugins/), with
>> p1.pl modified with:
>>
>>   use SNMP 5.0;
>>   SNMP::loadModules("ALL");
>
> This sounds like Daniels Problem.
Indeed.

> Two days ago we have compiled nagios without epn and perl cache.
> For now Nagios runs with a latency of 0.3 Secs.
Just rechecked. After 72 hours nagios still runs perfectly with an average 
service check latency of 0.3 seconds, max. 0.9 seconds.

Memory usage is perfectly "flat" now, with epn and perlcache it went from 
140 mb (whole system) to about 900 mb within 24h.

The average system load is a bit _lower_ than before, but some peaks 
higher than with epn/perlcache.

I'll try pure epn without perlcache first thing in january.

Danny
-- 
Q: Gentoo is too hard to install  =http://www.cyberdelia.de
and I feel like whining.   = [EMAIL PROTECTED]
A: Please see /dev/null.  =
   (from the gentoo installer FAQ) = \o/

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-24 Thread Joerg Linge
Am Sonntag, 24. Dezember 2006 11:35 schrieb Robert Hajime Lanning:
> I have also been having performance issues with Nagios 2.5 on
> a Sun E220R with two 400MHz procs and 1GB ram.
[...]

> I have noticed the Nagios seems to have a memory leak.  As,
> I have watched over the last hour the process grow from 124M
> to 126M.
>
> I use ePN with caching.  Most of my checks are SNMP requests
> via ePN scripts (http://lanning.cc/custom_plugins/), with
> p1.pl modified with:
>
>   use SNMP 5.0;
>   SNMP::loadModules("ALL");

This sounds like Daniels Problem.

Two days ago we have compiled nagios without epn and perl cache.

For now Nagios runs with a latency of 0.3 Secs.

Jörg

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-24 Thread Robert Hajime Lanning
I have also been having performance issues with Nagios 2.5 on
a Sun E220R with two 400MHz procs and 1GB ram.

Sys stats are at http://lanning.cc/kipper.html

The large dips in load and system CPU time are when I restart
Nagios.  (cron'd twice a week, but I have also been making
a lot of service updates lately, hence the almost once a day
restarts.)  For the restarts to fix the latency, I have
"use_retained_scheduling_info=0".

After about three days the Service Check latency will grow
to over 300 seconds.  It is usually steady at around 0-5
seconds, for a couple of days, then it will rise over the
course of a few hours to over the 300 second mark.

My biggest issue with this, is the fact the RRDTool does not
like the data points to be that far out of the expected time
intervals and will toss the data point.

I have noticed the Nagios seems to have a memory leak.  As,
I have watched over the last hour the process grow from 124M
to 126M.

I use ePN with caching.  Most of my checks are SNMP requests
via ePN scripts (http://lanning.cc/custom_plugins/), with
p1.pl modified with:

  use SNMP 5.0;
  SNMP::loadModules("ALL");

We have put into our budget to move Nagios to a Linux/Intel
server.  But, what bugs me is the high CPU time in kernel
space, because of Nagios.

---
$ nagios -s etc/nagios.cfg

Nagios 2.5
Copyright (c) 1999-2006 Ethan Galstad (http://www.nagios.org)
Last Modified: 07-13-2006
License: GPL

Projected scheduling information for host and service
checks is listed below.  This information assumes that
you are going to start running Nagios with your current
config files.

HOST SCHEDULING INFORMATION
---
Total hosts: 83
Total scheduled hosts:   0
Host inter-check delay method:   SMART
Average host check interval: 0.00 sec
Host inter-check delay:  0.00 sec
Max host check spread:   3 min
First scheduled check:   N/A
Last scheduled check:N/A


SERVICE SCHEDULING INFORMATION
---
Total services: 693
Total scheduled services:   693
Service inter-check delay method:   SMART
Average service check interval: 192.12 sec
Inter-check delay:  0.26 sec
Interleave factor method:   SMART
Average services per host:  8.35
Service interleave factor:  9
Max service check spread:   3 min
First scheduled check:  Sun Dec 24 10:02:16 2006
Last scheduled check:   Sun Dec 24 10:05:15 2006


CHECK PROCESSING INFORMATION

Service check reaper interval:  5 sec
Max concurrent service checks:  Unlimited


PERFORMANCE SUGGESTIONS
---
I have no suggestions - things look okay.


-- 
And, did Galoka think the Ulus were too ugly to save?
 -Centauri


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-21 Thread Tobias Klausmann
Hi! 

On Thu, 21 Dec 2006, Daniel Meyer wrote:
> > I have the suspicion that our check latency might converge on 419
> > seconds - but I'd rather not test it, we'd be well beyond the
> > 300s-interval most of our checks are designed for.
> 
> Why do you think of exactly 419 seconds?
> 
> And btw, if our problems are related the latency wont stop at that number 
> :)

Because that's the new average check latency as reported by -s.
Yes, I'm out on a limb there.

Regards,
Tobias

-- 
Never touch a burning system.

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-21 Thread Daniel Meyer
On Thu, 21 Dec 2006, Tobias Klausmann wrote:

> I have the suspicion that our check latency might converge on 419
> seconds - but I'd rather not test it, we'd be well beyond the
> 300s-interval most of our checks are designed for.

Why do you think of exactly 419 seconds?

And btw, if our problems are related the latency wont stop at that number 
:)

Danny
-- 
Q: Gentoo is too hard to install  =http://www.cyberdelia.de
and I feel like whining.   = [EMAIL PROTECTED]
A: Please see /dev/null.  =
   (from the gentoo installer FAQ) = \o/

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-21 Thread Tobias Klausmann
Hi! 

On Tue, 19 Dec 2006, Andreas Ericsson wrote:
> >>> SERVICE SCHEDULING INFORMATION
> >>> ---
> >>> Total services: 2836
> >>> Total scheduled services:   2836
> >>> Service inter-check delay method:   SMART
> >>> Average service check interval: 2225.56 sec
> >> This is, as you point out below, quite odd. What's your _longest_ 
> >> normal_check_interval for services?
> > 
> > The longest check_interval is 86400 seconds. It's a SSL cert
> > freshness check. I figured it wasn't necesseary to check that
> > more often than once a day. I also have check_intervals of 3, 5,
> > 15, 20, 30 and 1440 seconds. The latter is also a cert freshness
> > check which is lower because the customer wanted it to be that
> > short.
> > 
> 
> Try changing the really long intervals to something shorter or 
> commenting them out completely and see what happens. Checking a 
> certificate is not a particularly heavy operation so it doesn't matter 
> much if you run it ever 5 minutes. On the server side it just gets 
> handed out from cache, so it's not heave there either.

Actually, I was horribly wrong with that statement up there.

As it turned out, the check_interval was set to 86400. From that
I jumped to the conclusion "ah, one day" - familiar numbers do
that to you. But the base unit of check_interval isn't 1s, it's 1
minute. So the check_interval was 60 days. Fortunately, it was
only one such check which we quickly eliminated before producing
the second set of graphs I mentioned elsewhere in the thread.

Now, the longest check_interval truly is one day, 1440 minutes.
The average service check interval reported by -s is now 419
seconds. Still not terribly short, but it proves that the
86400-minute-monster was to blame for the 2200+ seconds.

Changing those once-a-day checks to 5 minutes is an option, but
I'd rather wait a little to give everybody on the list some time
to look at the graphs and come up with nifty ideas.

I have the suspicion that our check latency might converge on 419
seconds - but I'd rather not test it, we'd be well beyond the
300s-interval most of our checks are designed for. 

> > Oops, forgot to mention that. Yes, a server farm is being rebuilt
> > currently. As I didn't want all the host check timeouts to make
> > matters much, much, worse, I disabled them entirely.
> > 
> 
> Ah, that explains it then. It shouldn't matter, but unless the 
> experiment I suggested above turns up anything useful, would you mind 
> commenting them out and testing that?

I'll do that if removing the day-spaced-checks doesn't help.


Regards & Thanks,
Tobias
-- 
Never touch a burning system.

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-21 Thread Tobias Klausmann
Hi! 

On Thu, 21 Dec 2006, Daniel Meyer wrote:
> - it is not triggered by any other software on the server
>(nagios and apache are the only things running there)

ACK.

> - its not triggered by hourly, daily or weekly cronjobs

With a lot of guessing and estimating, I can make a case for a
slight "plateau" right after the hour, with an increase in the
second half of the hour. Might be completely bogus, though.

> - the big service check latency goes away instantly after a restart
>of nagios

ACK.

> - the latency skyrockets after "some time", its not like "six hours
>after the restart" or something like that

Well, not so much as skyrocketing, steadily creeping up. See the
images I reference below.

> - service check execution time does NOT change at all, it stays on
>the same level all the time

NACK. For me, it starts out at some low-two-digit ms time, then
creeps up to 165.000ms (yes, exactly that value). As far as I can
tell, it stays there forever.

> - changing from a dummy host check to "adaptive" host checks back and
>forth doesn't make a difference

We didn't try that.

> - i see memory usage rise proportional to the latency, but there is
>way enough free memory left (this morning it was 150 seconds latency
>but still 790 Megs free ram, plus one gig cached)

Same (with slightly different figures) here.

> - load on the system rises a little but not much

It's measurable, but definitely not maxed out. Same goes for CPU
utilization (which is something different)>

> - network usage goes down (well there are less checks done due to the
>latency, so no surprise here)

We haven't checked that but as network traffic (both volume and
packet rate) wasn't near any limit, we didn't feel it was
necessary.

Here are a few graphs we created for yesterday and the day before
that:

http://eric.schwarzvogel.de/~klausman/nagios-perf-1/

and here are the pics of today and yesterday afternoon:

http://eric.schwarzvogel.de/~klausman/nagios-perf-2/

For all graphs, check frequency was every 2 minutes. For the
older set, a SNAFU on my part when setting up the RRDs resulted
in reduced resolution. That was fixed with the second set.

"Queue size" is calculated the following way: look at all objects
in the state file (retention.dat, saved every 20s). Every object
with a check time in the past counts as one queue entry.

"Slots"/"Checks completed" is a what nagiostats reports as # of
checks completed in the four timeframes.

Things I noted:

Queue size oscillates wildly. This might be due to my
methodology. Still, one can read a trend from that curve.

Check execution time converges at 106ms. On the spot. I have no
idea why.

Load average and CPU idleness indicate that we don't have a host
performance problem (I also looked over but did not plot stuff
like interrupt rate and context switches, nothing overly high,
there).

For the older graphs, check latency doesn't budge at all for
some time (or it's too little to see it). For the newer graph,
the initial rise is rather steep, then increase slows down a bit.
Still, over the course of hours, it seems linear and shows no
sign of converging.

If anybody is interested in the RRD files used to generate the
graphs, drop me a line.

The picture all of this paints is rather inconclusive. We've
found an oddity in our config I'll relate in another mail (a
check interval of 86400 minutes, that's two months). We have
eliminated that for the newer graphs, however.

In conclusion, I'm at a loss as to why this slow deterioration of
check performance happens. 

A colleague of mine is looking at the Nagios scheduling code (he
thinks the description of the algorithm in the docs is rather
strange). He hasn't reported back yet, though.

All in all, every hint is appreciated.

Regards,
Tobias




-- 
Never touch a burning system.

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-20 Thread Daniel Meyer
Ok,

this is what i noticed on my performance issues during the last days:

- it is not triggered by any other software on the server
   (nagios and apache are the only things running there)

- its not triggered by hourly, daily or weekly cronjobs

- the big service check latency goes away instantly after a restart
   of nagios

- the latency skyrockets after "some time", its not like "six hours
   after the restart" or something like that

- service check execution time does NOT change at all, it stays on
   the same level all the time

- changing from a dummy host check to "adaptive" host checks back and
   forth doesn't make a difference

- i see memory usage rise proportional to the latency, but there is
   way enough free memory left (this morning it was 150 seconds latency
   but still 790 Megs free ram, plus one gig cached)

- load on the system rises a little but not much

- network usage goes down (well there are less checks done due to the
   latency, so no surprise here)


Details of my setup can be found in the "big performance issue..."-thread, 
if needed i can repost them here...

Danny
-- 
Q: Gentoo is too hard to install  =http://www.cyberdelia.de
and I feel like whining.   = [EMAIL PROTECTED]
A: Please see /dev/null.  =
   (from the gentoo installer FAQ) = \o/

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-19 Thread Tobias Klausmann
Hi! 

On Tue, 19 Dec 2006, Andreas Ericsson wrote:
> >>> ---
> >>> Total services: 2836
> >>> Total scheduled services:   2836
> >>> Service inter-check delay method:   SMART
> >>> Average service check interval: 2225.56 sec
> >> This is, as you point out below, quite odd. What's your _longest_ 
> >> normal_check_interval for services?
> > 
> > The longest check_interval is 86400 seconds. It's a SSL cert
> > freshness check. I figured it wasn't necesseary to check that
> > more often than once a day. I also have check_intervals of 3, 5,
> > 15, 20, 30 and 1440 seconds. The latter is also a cert freshness
> > check which is lower because the customer wanted it to be that
> > short.
> 
> Try changing the really long intervals to something shorter or 
> commenting them out completely and see what happens. Checking a 
> certificate is not a particularly heavy operation so it doesn't matter 
> much if you run it ever 5 minutes. On the server side it just gets 
> handed out from cache, so it's not heave there either.
> 
> If you have the various normal_check_interval's specified in templates, 
> try setting them all to 5 minutes and let Nagios run over-night. If this 
> interferes with some fragile services on the network (webservers whose 
> sessions don't expire, fe), disable active checks for those services 
> during the testing period.
> 
> (yes, this might seem braindead, but I really need to know if this bug 
> is still in Nagios).

I'll do that this afternoon, I'd just like to wait a little more
regarding the changes my kernel/cpu-update brings (or doesn't).

> >>> *Or* it is indicative of a misconfiguration on my
> >>> part. If the latter is the case, I'd be eager, nay ecstatic to
> >>> hear what I did wrong. Here are a few of the config vars that
> >>> might influence this:
> >> There has been a slight thinko in Nagios. I don't know if it's still 
> >> there in recent CVS versions. The thinko is that it (used to?) calculate 
> >> average service check interval by adding up all normal_check_interval 
> >> values and dividing it by the number of services configured (or 
> >> something along those lines), which leads to long latencies. This 
> >> normally didn't make those latencies increase though. Humm...
> > 
> > Well, the numbers sure do get whacky after a restart: first it
> > skyrockets for about five minutes, then plummets to 1s. From
> > there it works its way up the way I described.
> 
> Are the first checks of things being scheduled with unreasonably long 
> delays? Fe, a check with 3 minute normal_check_interval being scheduled 
> an hour or so into the future.

Usually, yes. As I use state retention, I don't believe in the
initial numbers all that much. After about 5-10 minutes one can
usually make out a trend. Not this time, though. Here's hoping
that it keeps oscillating around the 8-9 seconds I currently.

> >>> Total Services:   2836
> >>> Services Checked: 2836
> >>> Services Scheduled:   2758
> >>> Active Service Checks:2836
> >>> Passive Service Checks:   0
> >> All services aren't being scheduled, but you have no passive service 
> >> checks. Have you disabled checks of 78 services?
> > 
> > Oops, forgot to mention that. Yes, a server farm is being rebuilt
> > currently. As I didn't want all the host check timeouts to make
> > matters much, much, worse, I disabled them entirely.
> 
> Ah, that explains it then. It shouldn't matter, but unless the 
> experiment I suggested above turns up anything useful, would you mind 
> commenting them out and testing that?

I was planning to do that tomorrow for the very same reasons.

> >>> Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface.
> >>> LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both
> >>> around 40% idle most of the time. I see about 300 context
> >>> switches and 500 interrupts per second. The network load is
> >>> neglible, ditto the packet rate.
> >>>
> >>> The way these figures look I don't see a performance problem per
> >>> se, but maybe I have overlooked a metric that descirbes the
> >>> "usual" bottleneck of installations.
> >>>
> >> Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel 
> >> cpu's, that causes up to 60% performance loss (yes, it really is that bad).
> > 
> > Sheesh. Yes, it is a 32-bit installation. I only ever bothered
> > with 64-bit installs on Opteron hardware. I might look into
> > migrating to 64 bits, then.
> > 
> 
> So the CPU's are 64-bits? Humm... 64-bit mode would boost available 
> resources quite a bit, but as you just enabled HT you should now have 3 
> extra CPU's (Xeon's are dualcore AFAIR) which will probably set you safe 
> for a while.

Colleague just told me that this particular batch wasn't
available in 64 bits. So no, they're 32bits, well one thing to
test out of the way :-/

> >> I'm puzzled. Please let me know

Re: [Nagios-users] Performance issues, too

2006-12-19 Thread Tobias Klausmann
Hi! 

On Tue, 19 Dec 2006, Daniel Meyer wrote:
> >> You could lower this to 2 seconds. I've done so on any number of
> >> installations and it has no negative impact what so ever, but seems to
> >> make Nagios a bit more responsive.
> >
> > I'll give that a try.
> 
> I've tried that but had some failing checks when i did that. Very 
> strange...

I'm still waiting how the kernel change will work out.

> > I also noticed that HT was disabled on the machine. I've changed
> > that (and added support for it to the kernel) when I did the
> > kernel upgrade today. I'll keep an eye on check latency.
> 
> I have HT enabled, no effect on the nagios latency problems.

I've now setup a little script that puts host and service check
latency in an RRD file every five minutes. So far, the curve
looks very inconclusive.

Regards,
Tobias
-- 
Never touch a burning system.

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-19 Thread Daniel Meyer
On Tue, 19 Dec 2006, Tobias Klausmann wrote:

> I'm running 2.6 now but I had the troubles with 2.5 initially.
> OS is a Gentoo Linux, Kernel 2.6.15.5 initially, upgrade to
> 2.6.19 today.

Same here. Latency-Problems with both 2.5 and 2.6, but on CentOS 4.4 (good 
that you use gentoo, saves me the time to try it on a heavy optimized 
gentoo box :)

>> You could lower this to 2 seconds. I've done so on any number of
>> installations and it has no negative impact what so ever, but seems to
>> make Nagios a bit more responsive.
>
> I'll give that a try.

I've tried that but had some failing checks when i did that. Very 
strange...

> I also noticed that HT was disabled on the machine. I've changed
> that (and added support for it to the kernel) when I did the
> kernel upgrade today. I'll keep an eye on check latency.

I have HT enabled, no effect on the nagios latency problems.

Danny
-- 
Q: Gentoo is too hard to install  =http://www.cyberdelia.de
and I feel like whining.   = [EMAIL PROTECTED]
A: Please see /dev/null.  =
   (from the gentoo installer FAQ) = \o/

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-19 Thread Andreas Ericsson
Tobias Klausmann wrote:
> Hi! 
> 
> On Tue, 19 Dec 2006, Andreas Ericsson wrote:
>> Thanks for an excellently detailed problem report, missing only the 
>> Nagios version and system type/version info. I've got some comments and 
>> followup questions. See below.
> 
> I'm running 2.6 now but I had the troubles with 2.5 initially.
> OS is a Gentoo Linux, Kernel 2.6.15.5 initially, upgrade to
> 2.6.19 today.
> 
>>> ---
>>> Total hosts: 330
>>> Total scheduled hosts:   0
>> No scheduled host-checks. That's good, cause they interfere with normal 
>> operations in Nagios.
> 
> I've read as much. In my seperate mail I had a few questions
> about it, let's keep them (and the answers there ;)
> 
>>> Host inter-check delay method:   SMART
>>> Average host check interval: 0.00 sec
>>> Host inter-check delay:  0.00 sec
>>> Max host check spread:   10 min
>>> First scheduled check:   N/A
>>> Last scheduled check:N/A
>>>
>>>
>>> SERVICE SCHEDULING INFORMATION
>>> ---
>>> Total services: 2836
>>> Total scheduled services:   2836
>>> Service inter-check delay method:   SMART
>>> Average service check interval: 2225.56 sec
>> This is, as you point out below, quite odd. What's your _longest_ 
>> normal_check_interval for services?
> 
> The longest check_interval is 86400 seconds. It's a SSL cert
> freshness check. I figured it wasn't necesseary to check that
> more often than once a day. I also have check_intervals of 3, 5,
> 15, 20, 30 and 1440 seconds. The latter is also a cert freshness
> check which is lower because the customer wanted it to be that
> short.
> 

Try changing the really long intervals to something shorter or 
commenting them out completely and see what happens. Checking a 
certificate is not a particularly heavy operation so it doesn't matter 
much if you run it ever 5 minutes. On the server side it just gets 
handed out from cache, so it's not heave there either.

If you have the various normal_check_interval's specified in templates, 
try setting them all to 5 minutes and let Nagios run over-night. If this 
interferes with some fragile services on the network (webservers whose 
sessions don't expire, fe), disable active checks for those services 
during the testing period.

(yes, this might seem braindead, but I really need to know if this bug 
is still in Nagios).

> 
>>> *Or* it is indicative of a misconfiguration on my
>>> part. If the latter is the case, I'd be eager, nay ecstatic to
>>> hear what I did wrong. Here are a few of the config vars that
>>> might influence this:
>> There has been a slight thinko in Nagios. I don't know if it's still 
>> there in recent CVS versions. The thinko is that it (used to?) calculate 
>> average service check interval by adding up all normal_check_interval 
>> values and dividing it by the number of services configured (or 
>> something along those lines), which leads to long latencies. This 
>> normally didn't make those latencies increase though. Humm...
> 
> Well, the numbers sure do get whacky after a restart: first it
> skyrockets for about five minutes, then plummets to 1s. From
> there it works its way up the way I described.
> 

Are the first checks of things being scheduled with unreasonably long 
delays? Fe, a check with 3 minute normal_check_interval being scheduled 
an hour or so into the future.


>>> Total Services:   2836
>>> Services Checked: 2836
>>> Services Scheduled:   2758
>>> Active Service Checks:2836
>>> Passive Service Checks:   0
>> All services aren't being scheduled, but you have no passive service 
>> checks. Have you disabled checks of 78 services?
> 
> Oops, forgot to mention that. Yes, a server farm is being rebuilt
> currently. As I didn't want all the host check timeouts to make
> matters much, much, worse, I disabled them entirely.
> 

Ah, that explains it then. It shouldn't matter, but unless the 
experiment I suggested above turns up anything useful, would you mind 
commenting them out and testing that?

>>> Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface.
>>> LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both
>>> around 40% idle most of the time. I see about 300 context
>>> switches and 500 interrupts per second. The network load is
>>> neglible, ditto the packet rate.
>>>
>>> The way these figures look I don't see a performance problem per
>>> se, but maybe I have overlooked a metric that descirbes the
>>> "usual" bottleneck of installations.
>>>
>> Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel 
>> cpu's, that causes up to 60% performance loss (yes, it really is that bad).
> 
> Sheesh. Yes, it is a 32-bit installation. I only ever bothered
> with 64-bit installs on Opteron hardware. I might look into
> migrating to 64 bits, then.
> 

So the CPU'

Re: [Nagios-users] Performance issues, too

2006-12-19 Thread Tobias Klausmann
Hi! 

On Tue, 19 Dec 2006, Andreas Ericsson wrote:
> Thanks for an excellently detailed problem report, missing only the 
> Nagios version and system type/version info. I've got some comments and 
> followup questions. See below.

I'm running 2.6 now but I had the troubles with 2.5 initially.
OS is a Gentoo Linux, Kernel 2.6.15.5 initially, upgrade to
2.6.19 today.

> > ---
> > Total hosts: 330
> > Total scheduled hosts:   0
> 
> No scheduled host-checks. That's good, cause they interfere with normal 
> operations in Nagios.

I've read as much. In my seperate mail I had a few questions
about it, let's keep them (and the answers there ;)

> > Host inter-check delay method:   SMART
> > Average host check interval: 0.00 sec
> > Host inter-check delay:  0.00 sec
> > Max host check spread:   10 min
> > First scheduled check:   N/A
> > Last scheduled check:N/A
> > 
> > 
> > SERVICE SCHEDULING INFORMATION
> > ---
> > Total services: 2836
> > Total scheduled services:   2836
> > Service inter-check delay method:   SMART
> > Average service check interval: 2225.56 sec
> 
> This is, as you point out below, quite odd. What's your _longest_ 
> normal_check_interval for services?

The longest check_interval is 86400 seconds. It's a SSL cert
freshness check. I figured it wasn't necesseary to check that
more often than once a day. I also have check_intervals of 3, 5,
15, 20, 30 and 1440 seconds. The latter is also a cert freshness
check which is lower because the customer wanted it to be that
short.

> > CHECK PROCESSING INFORMATION
> > 
> > Service check reaper interval:  5 sec
> 
> You could lower this to 2 seconds. I've done so on any number of 
> installations and it has no negative impact what so ever, but seems to 
> make Nagios a bit more responsive.

I'll give that a try.

> > Max concurrent service checks:  Unlimited
> 
> I assume you aren't running in to hardware limits on this machine. 
> What's the normal load when you're running nagios? If it's > NUM_CPUS 
> then you most likely don't have beefy enough hardware. That's hardly 
> ever the case though, so don't bother looking into it unless all else fails.
> 
> Nvm, question answered below. Hardware resources should be no problem 
> what so ever.

I also noticed that HT was disabled on the machine. I've changed
that (and added support for it to the kernel) when I did the
kernel upgrade today. I'll keep an eye on check latency.

> > *Or* it is indicative of a misconfiguration on my
> > part. If the latter is the case, I'd be eager, nay ecstatic to
> > hear what I did wrong. Here are a few of the config vars that
> > might influence this:
> 
> There has been a slight thinko in Nagios. I don't know if it's still 
> there in recent CVS versions. The thinko is that it (used to?) calculate 
> average service check interval by adding up all normal_check_interval 
> values and dividing it by the number of services configured (or 
> something along those lines), which leads to long latencies. This 
> normally didn't make those latencies increase though. Humm...

Well, the numbers sure do get whacky after a restart: first it
skyrockets for about five minutes, then plummets to 1s. From
there it works its way up the way I described.

> > Total Services:   2836
> > Services Checked: 2836
> > Services Scheduled:   2758
> > Active Service Checks:2836
> > Passive Service Checks:   0
> 
> All services aren't being scheduled, but you have no passive service 
> checks. Have you disabled checks of 78 services?

Oops, forgot to mention that. Yes, a server farm is being rebuilt
currently. As I didn't want all the host check timeouts to make
matters much, much, worse, I disabled them entirely.

> > Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface.
> > LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both
> > around 40% idle most of the time. I see about 300 context
> > switches and 500 interrupts per second. The network load is
> > neglible, ditto the packet rate.
> > 
> > The way these figures look I don't see a performance problem per
> > se, but maybe I have overlooked a metric that descirbes the
> > "usual" bottleneck of installations.
> > 
> 
> Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel 
> cpu's, that causes up to 60% performance loss (yes, it really is that bad).

Sheesh. Yes, it is a 32-bit installation. I only ever bothered
with 64-bit installs on Opteron hardware. I might look into
migrating to 64 bits, then.

> I'm puzzled. Please let me know if you find the answer to this problem. 
> I'll help you debug it as best I can, but please continue posting 
> on-list. Thanks.

Sure. I'll first check if the "processor upgrade" and kernel
update helped anything, then t

Re: [Nagios-users] Performance issues, too

2006-12-19 Thread Daniel Meyer
On Tue, 19 Dec 2006, Andreas Ericsson wrote:

> Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel
> cpu's, that causes up to 60% performance loss (yes, it really is that bad).

I just can answer for my setup (which is almost identical except for i 
have "only" 1700 service checks so far): my xeon cpus are pure 32 bit 
stuff...

> I'm puzzled. Please let me know if you find the answer to this problem.
> I'll help you debug it as best I can, but please continue posting
> on-list. Thanks.

Me too, i am somewhat out of ideas...

Danny
-- 
Q: Gentoo is too hard to install  =http://www.cyberdelia.de
and I feel like whining.   = [EMAIL PROTECTED]
A: Please see /dev/null.  =
   (from the gentoo installer FAQ) = \o/

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Performance issues, too

2006-12-19 Thread Andreas Ericsson
Thanks for an excellently detailed problem report, missing only the 
Nagios version and system type/version info. I've got some comments and 
followup questions. See below.

Tobias Klausmann wrote:
> Hi! 
> 
> Recently I have run into the very same performance issues 
> as Daniel Meyer (or so it seems). However, I'm not quite sure
> about it. Here's the gist of it.
> 
> Currently, service check latency slowly creeps up. As it is now,
> it starts out at a little over 1s and after about 12 hours it's
> in the area of about 90s. It keeps climbing after that. 
> 
> Here's the output of nagios -s:
> 
> HOST SCHEDULING INFORMATION
> ---
> Total hosts: 330
> Total scheduled hosts:   0

No scheduled host-checks. That's good, cause they interfere with normal 
operations in Nagios.

> Host inter-check delay method:   SMART
> Average host check interval: 0.00 sec
> Host inter-check delay:  0.00 sec
> Max host check spread:   10 min
> First scheduled check:   N/A
> Last scheduled check:N/A
> 
> 
> SERVICE SCHEDULING INFORMATION
> ---
> Total services: 2836
> Total scheduled services:   2836
> Service inter-check delay method:   SMART
> Average service check interval: 2225.56 sec


This is, as you point out below, quite odd. What's your _longest_ 
normal_check_interval for services?


> Inter-check delay:  0.21 sec
> Interleave factor method:   SMART
> Average services per host:  8.59
> Service interleave factor:  9
> Max service check spread:   10 min
> First scheduled check:  Tue Dec 19 11:21:45 2006
> Last scheduled check:   Tue Dec 19 11:31:47 2006
> 
> 
> CHECK PROCESSING INFORMATION
> 
> Service check reaper interval:  5 sec

You could lower this to 2 seconds. I've done so on any number of 
installations and it has no negative impact what so ever, but seems to 
make Nagios a bit more responsive.

> Max concurrent service checks:  Unlimited
> 

I assume you aren't running in to hardware limits on this machine. 
What's the normal load when you're running nagios? If it's > NUM_CPUS 
then you most likely don't have beefy enough hardware. That's hardly 
ever the case though, so don't bother looking into it unless all else fails.

Nvm, question answered below. Hardware resources should be no problem 
what so ever.

> 
> This all looks peachy - I think. What I don't get is this line:
> 
> Average service check interval: 2225.56 sec
> 
> It seems to me that this is either a skewed value, stemming from
> my history of looong latencies (at one point we were beyonf
> 9000 seconds).

Nopes. Nagios doesn't bother reading logfiles when it calculates the 
scheduling numbers.

> *Or* it is indicative of a misconfiguration on my
> part. If the latter is the case, I'd be eager, nay ecstatic to
> hear what I did wrong. Here are a few of the config vars that
> might influence this:
> 

There has been a slight thinko in Nagios. I don't know if it's still 
there in recent CVS versions. The thinko is that it (used to?) calculate 
average service check interval by adding up all normal_check_interval 
values and dividing it by the number of services configured (or 
something along those lines), which leads to long latencies. This 
normally didn't make those latencies increase though. Humm...


> sleep_time=0.25
> service_reaper_frequency=5
> max_concurrent_checks=0
> max_host_check_spread=10
> host_inter_check_delay_method=s
> service_interleave_factor=s
> command_check_interval=1
> obsess_over_services=0
> aggregate_status_updates=1
> status_update_interval=20
> 
> Also, here's the output from nagiostats:
> Nagios Stats 2.6
> Copyright (c) 2003-2005 Ethan Galstad (www.nagios.org)
> Last Modified: 11-27-2006
> License: GPL
> 
> CURRENT STATUS DATA
> 
> Status File:  /var/nagios/status.dat
> Status File Age:  0d 0h 0m 3s
> Status File Version:  2.6
> 
> Program Running Time: 0d 1h 59m 5s
> 
> Total Services:   2836
> Services Checked: 2836
> Services Scheduled:   2758
> Active Service Checks:2836
> Passive Service Checks:   0


All services aren't being scheduled, but you have no passive service 
checks. Have you disabled checks of 78 services?


> Total Service State Change:   0.000 / 12.370 / 0.007 %
> Active Service Latency:   0.006 / 10.237 / 0.906 sec
> Active Service Execution Time:0.047 / 10.159 / 0.180 sec
> Active Service State Change:  0.000 / 12.370 / 0.007 %
> Active Services Last 1/5/15/60 min:   477 / 2678 / 2745 / 2754
> Passive Service State Change: 0.000 / 0.000 / 0.000 %
> Passive Services Last 1/5/15/60 min:  

[Nagios-users] Performance issues, too

2006-12-19 Thread Tobias Klausmann
Hi! 

Recently I have run into the very same performance issues 
as Daniel Meyer (or so it seems). However, I'm not quite sure
about it. Here's the gist of it.

Currently, service check latency slowly creeps up. As it is now,
it starts out at a little over 1s and after about 12 hours it's
in the area of about 90s. It keeps climbing after that. 

Here's the output of nagios -s:
Nagios 2.6
Copyright (c) 1999-2006 Ethan Galstad (http://www.nagios.org)
Last Modified: 11-27-2006
License: GPL

Warning: Contact group 'Singles-Truppe' is not used in any
host/service definitions or host/service escalations!
Projected scheduling information for host and service
checks is listed below.  This information assumes that
you are going to start running Nagios with your current
config files.

HOST SCHEDULING INFORMATION
---
Total hosts: 330
Total scheduled hosts:   0
Host inter-check delay method:   SMART
Average host check interval: 0.00 sec
Host inter-check delay:  0.00 sec
Max host check spread:   10 min
First scheduled check:   N/A
Last scheduled check:N/A


SERVICE SCHEDULING INFORMATION
---
Total services: 2836
Total scheduled services:   2836
Service inter-check delay method:   SMART
Average service check interval: 2225.56 sec
Inter-check delay:  0.21 sec
Interleave factor method:   SMART
Average services per host:  8.59
Service interleave factor:  9
Max service check spread:   10 min
First scheduled check:  Tue Dec 19 11:21:45 2006
Last scheduled check:   Tue Dec 19 11:31:47 2006


CHECK PROCESSING INFORMATION

Service check reaper interval:  5 sec
Max concurrent service checks:  Unlimited


PERFORMANCE SUGGESTIONS
---
I have no suggestions - things look okay.

This all looks peachy - I think. What I don't get is this line:

Average service check interval: 2225.56 sec

It seems to me that this is either a skewed value, stemming from
my history of looong latencies (at one point we were beyonf
9000 seconds). *Or* it is indicative of a misconfiguration on my
part. If the latter is the case, I'd be eager, nay ecstatic to
hear what I did wrong. Here are a few of the config vars that
might influence this:

sleep_time=0.25
service_reaper_frequency=5
max_concurrent_checks=0
max_host_check_spread=10
host_inter_check_delay_method=s
service_interleave_factor=s
command_check_interval=1
obsess_over_services=0
aggregate_status_updates=1
status_update_interval=20

Also, here's the output from nagiostats:
Nagios Stats 2.6
Copyright (c) 2003-2005 Ethan Galstad (www.nagios.org)
Last Modified: 11-27-2006
License: GPL

CURRENT STATUS DATA

Status File:  /var/nagios/status.dat
Status File Age:  0d 0h 0m 3s
Status File Version:  2.6

Program Running Time: 0d 1h 59m 5s

Total Services:   2836
Services Checked: 2836
Services Scheduled:   2758
Active Service Checks:2836
Passive Service Checks:   0
Total Service State Change:   0.000 / 12.370 / 0.007 %
Active Service Latency:   0.006 / 10.237 / 0.906 sec
Active Service Execution Time:0.047 / 10.159 / 0.180 sec
Active Service State Change:  0.000 / 12.370 / 0.007 %
Active Services Last 1/5/15/60 min:   477 / 2678 / 2745 / 2754
Passive Service State Change: 0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:  0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:2814 / 6 / 0 / 16
Services Flapping:0
Services In Downtime: 0

Total Hosts:  330
Hosts Checked:330
Hosts Scheduled:  0
Active Host Checks:   330
Passive Host Checks:  0
Total Host State Change:  0.000 / 0.000 / 0.000 %
Active Host Latency:  0.000 / 1.000 / 0.888 sec
Active Host Execution Time:   0.030 / 4.059 / 0.112 sec
Active Host State Change: 0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min:  0 / 12 / 12 / 12
Passive Host State Change:0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach:329 / 1 / 0
Hosts Flapping:   0
Hosts In Downtime:0

Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface.
LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both
around 40% idle most of the time. I see about 300 context
switches and 500 interrupts per second. The network load is
neglible, ditto the packet rate.

The way these figures look I don't see a performance problem per
se, but maybe I have