Re: [Nagios-users] bizarre Nagios 2.12 memory leak

2010-04-16 Thread Rick Mangus
Have you checked where your memory is being used?  I had a similar
problem, and found I had 30k+ processes of nsca eating swap and PIDs.
The system would die one of two ways:  OOM or unable to spawn new
processes due to lack of PIDs.

In my case, it turned out that processing perfdata could block due to
database problems, causing nagios to bog down and nsca processes to
back up in a major way.  Finding and fixing that was ... special.

Anyway, running out of swap is good to know, but is nagios using 16GB
of RAM?  Or is it disappearing elsewhere?

Good Luck

--Rick

On Thu, Apr 15, 2010 at 4:13 PM, Andreas Ericsson  wrote:
> On 04/15/2010 05:24 PM, Jeremy wrote:
>>
>> I know I really should get around to upgrading to Nagios 3.x but no time for
>> that yet and it's going to be a pain to upgrade them all at once without
>> being blind for a little bit, so pretend Nagios 3.x isn't an option just
>> yet.
>>
>
> The truth of the matter though is that noone really cares about fixing a
> problem in 2.12 unless it's also a problem in 3.2.1, and especially if
> it's a bug as hard to debug as this one. Insofar as I know, configuration
> files are compatible between those two revisions, so you could just use
> 3.2.1 as a drop-in replacement for 2.12.
>
> Some minor things have to be changed in nagios.cfg (and possibly cgi.cfg),
> but the bulk of the configuration should be ok the way it is.
>
> Since this is an otherwise intermittent error which may well depend on other
> variables (such as pthread library version, glibc library version or any
> other system library), it's well nigh impossible to debug without having
> you run Nagios through valgrind until it crashes due to lack of memory.
>
> Rest assured that that will keep your Nagios running crippled longer than
> an upgrade would
>
> If the problem persists with Nagios 3.2.1, you should look into upgrading
> the rest of your system. If that doesn't help either, it's time to report
> it as a bug.
>
> --
> Andreas Ericsson                   andreas.erics...@op5.se
> OP5 AB                             www.op5.se
> Tel: +46 8-230225                  Fax: +46 8-230231
>
> Considering the successes of the wars on alcohol, poverty, drugs and
> terror, I think we should give some serious thought to declaring war
> on peace.
>
> --
> Download Intel® Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> ___
> Nagios-users mailing list
> Nagios-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting 
> any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] bizarre Nagios 2.12 memory leak

2010-04-15 Thread Andreas Ericsson
On 04/15/2010 05:24 PM, Jeremy wrote:
> 
> I know I really should get around to upgrading to Nagios 3.x but no time for
> that yet and it's going to be a pain to upgrade them all at once without
> being blind for a little bit, so pretend Nagios 3.x isn't an option just
> yet.
> 

The truth of the matter though is that noone really cares about fixing a
problem in 2.12 unless it's also a problem in 3.2.1, and especially if
it's a bug as hard to debug as this one. Insofar as I know, configuration
files are compatible between those two revisions, so you could just use
3.2.1 as a drop-in replacement for 2.12.

Some minor things have to be changed in nagios.cfg (and possibly cgi.cfg),
but the bulk of the configuration should be ok the way it is.

Since this is an otherwise intermittent error which may well depend on other
variables (such as pthread library version, glibc library version or any
other system library), it's well nigh impossible to debug without having
you run Nagios through valgrind until it crashes due to lack of memory.

Rest assured that that will keep your Nagios running crippled longer than
an upgrade would

If the problem persists with Nagios 3.2.1, you should look into upgrading
the rest of your system. If that doesn't help either, it's time to report
it as a bug.

-- 
Andreas Ericsson   andreas.erics...@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] bizarre Nagios 2.12 memory leak

2010-04-15 Thread Giorgio Zarrelli
Did you check zombie procs growth an iowait om cpu?

Ciao,

Giorgio

Il giorno 15/apr/2010, alle ore 17.24, Jeremy  ha  
scritto:

> We have a large distributed setup running Nagios 2.12 with 20  
> distributed servers sharing about 2 checks against 2500 hosts.  
> They are reporting into multiple master Nagios servers using a  
> modified OCP_daemon that handles multiple master servers. Recently  
> we nearly doubled our number of distributed servers. Our number of  
> checks had grown so we only were doing about 20-30% per minute on  
> some of our most busy distributed servers. Now we are doing 90% per  
> minute.
>
> Ever since we increased the frequency of all the checks, our oldest  
> Master server has started crashing randomly every so often. Nothing  
> else has changed. Memory use goes through the roof until eventually  
> there is 0 swap left and the server finally crashes and has to be  
> rebooted. If we restart the Nagios service while the memory usage is  
> going crazy, it drops back down to normal for quite a while, but  
> days later it will happen again. I started restarting Nagios on that  
> server once an hour but it hasn't helped. We tried upgrading to 16  
> GB of RAM which has made this happen a bit less often, but it  
> continues to happen sometimes.
>
> We are using NPCD to graph the performance data from all of our  
> checks, but all the graph .RRD files are on a dedicated partition,  
> and the crashing happens even when we disable graphing completely  
> and disk I/O is near 0% on both the system partitions and the graph  
> partition.
>
> So I was wondering how I could go about figuring out why Nagios is  
> freaking out on our older server (Dell PowerEdge 1950). Our other  
> Master server (a Dell PowerEdge R710) gets all the same checks  
> reported to it, and handles it just fine, but it using much newer  
> Xeon CPUs, faster memory, etc. The old crashing server handles  
> things just fine for days at a time until it randomly runs itself  
> out of swap space and crashes.
>
> I know I really should get around to upgrading to Nagios 3.x but no  
> time for that yet and it's going to be a pain to upgrade them all at  
> once without being blind for a little bit, so pretend Nagios 3.x  
> isn't an option just yet.
>
> Thanks for any insight!
> Jeremy
> --- 
> --- 
> --- 
> -
> Download Intel® Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> ___
> Nagios-users mailing list
> Nagios-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when  
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] bizarre Nagios 2.12 memory leak

2010-04-15 Thread Jeremy
We have a large distributed setup running Nagios 2.12 with 20 distributed
servers sharing about 2 checks against 2500 hosts. They are reporting
into multiple master Nagios servers using a modified OCP_daemon that handles
multiple master servers. Recently we nearly doubled our number of
distributed servers. Our number of checks had grown so we only were doing
about 20-30% per minute on some of our most busy distributed servers. Now we
are doing 90% per minute.

Ever since we increased the frequency of all the checks, our oldest Master
server has started crashing randomly every so often. Nothing else has
changed. Memory use goes through the roof until eventually there is 0 swap
left and the server finally crashes and has to be rebooted. If we restart
the Nagios service while the memory usage is going crazy, it drops back down
to normal for quite a while, but days later it will happen again. I started
restarting Nagios on that server once an hour but it hasn't helped. We tried
upgrading to 16 GB of RAM which has made this happen a bit less often, but
it continues to happen sometimes.

We are using NPCD to graph the performance data from all of our checks, but
all the graph .RRD files are on a dedicated partition, and the crashing
happens even when we disable graphing completely and disk I/O is near 0% on
both the system partitions and the graph partition.

So I was wondering how I could go about figuring out why Nagios is freaking
out on our older server (Dell PowerEdge 1950). Our other Master server (a
Dell PowerEdge R710) gets all the same checks reported to it, and handles it
just fine, but it using much newer Xeon CPUs, faster memory, etc. The old
crashing server handles things just fine for days at a time until it
randomly runs itself out of swap space and crashes.

I know I really should get around to upgrading to Nagios 3.x but no time for
that yet and it's going to be a pain to upgrade them all at once without
being blind for a little bit, so pretend Nagios 3.x isn't an option just
yet.

Thanks for any insight!
Jeremy
--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null