Re: [Nagios-users] bizarre Nagios 2.12 memory leak
Have you checked where your memory is being used? I had a similar problem, and found I had 30k+ processes of nsca eating swap and PIDs. The system would die one of two ways: OOM or unable to spawn new processes due to lack of PIDs. In my case, it turned out that processing perfdata could block due to database problems, causing nagios to bog down and nsca processes to back up in a major way. Finding and fixing that was ... special. Anyway, running out of swap is good to know, but is nagios using 16GB of RAM? Or is it disappearing elsewhere? Good Luck --Rick On Thu, Apr 15, 2010 at 4:13 PM, Andreas Ericsson wrote: > On 04/15/2010 05:24 PM, Jeremy wrote: >> >> I know I really should get around to upgrading to Nagios 3.x but no time for >> that yet and it's going to be a pain to upgrade them all at once without >> being blind for a little bit, so pretend Nagios 3.x isn't an option just >> yet. >> > > The truth of the matter though is that noone really cares about fixing a > problem in 2.12 unless it's also a problem in 3.2.1, and especially if > it's a bug as hard to debug as this one. Insofar as I know, configuration > files are compatible between those two revisions, so you could just use > 3.2.1 as a drop-in replacement for 2.12. > > Some minor things have to be changed in nagios.cfg (and possibly cgi.cfg), > but the bulk of the configuration should be ok the way it is. > > Since this is an otherwise intermittent error which may well depend on other > variables (such as pthread library version, glibc library version or any > other system library), it's well nigh impossible to debug without having > you run Nagios through valgrind until it crashes due to lack of memory. > > Rest assured that that will keep your Nagios running crippled longer than > an upgrade would > > If the problem persists with Nagios 3.2.1, you should look into upgrading > the rest of your system. If that doesn't help either, it's time to report > it as a bug. > > -- > Andreas Ericsson andreas.erics...@op5.se > OP5 AB www.op5.se > Tel: +46 8-230225 Fax: +46 8-230231 > > Considering the successes of the wars on alcohol, poverty, drugs and > terror, I think we should give some serious thought to declaring war > on peace. > > -- > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > ___ > Nagios-users mailing list > Nagios-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nagios-users > ::: Please include Nagios version, plugin version (-v) and OS when reporting > any issue. > ::: Messages without supporting info will risk being sent to /dev/null > -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] bizarre Nagios 2.12 memory leak
On 04/15/2010 05:24 PM, Jeremy wrote: > > I know I really should get around to upgrading to Nagios 3.x but no time for > that yet and it's going to be a pain to upgrade them all at once without > being blind for a little bit, so pretend Nagios 3.x isn't an option just > yet. > The truth of the matter though is that noone really cares about fixing a problem in 2.12 unless it's also a problem in 3.2.1, and especially if it's a bug as hard to debug as this one. Insofar as I know, configuration files are compatible between those two revisions, so you could just use 3.2.1 as a drop-in replacement for 2.12. Some minor things have to be changed in nagios.cfg (and possibly cgi.cfg), but the bulk of the configuration should be ok the way it is. Since this is an otherwise intermittent error which may well depend on other variables (such as pthread library version, glibc library version or any other system library), it's well nigh impossible to debug without having you run Nagios through valgrind until it crashes due to lack of memory. Rest assured that that will keep your Nagios running crippled longer than an upgrade would If the problem persists with Nagios 3.2.1, you should look into upgrading the rest of your system. If that doesn't help either, it's time to report it as a bug. -- Andreas Ericsson andreas.erics...@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] bizarre Nagios 2.12 memory leak
Did you check zombie procs growth an iowait om cpu? Ciao, Giorgio Il giorno 15/apr/2010, alle ore 17.24, Jeremy ha scritto: > We have a large distributed setup running Nagios 2.12 with 20 > distributed servers sharing about 2 checks against 2500 hosts. > They are reporting into multiple master Nagios servers using a > modified OCP_daemon that handles multiple master servers. Recently > we nearly doubled our number of distributed servers. Our number of > checks had grown so we only were doing about 20-30% per minute on > some of our most busy distributed servers. Now we are doing 90% per > minute. > > Ever since we increased the frequency of all the checks, our oldest > Master server has started crashing randomly every so often. Nothing > else has changed. Memory use goes through the roof until eventually > there is 0 swap left and the server finally crashes and has to be > rebooted. If we restart the Nagios service while the memory usage is > going crazy, it drops back down to normal for quite a while, but > days later it will happen again. I started restarting Nagios on that > server once an hour but it hasn't helped. We tried upgrading to 16 > GB of RAM which has made this happen a bit less often, but it > continues to happen sometimes. > > We are using NPCD to graph the performance data from all of our > checks, but all the graph .RRD files are on a dedicated partition, > and the crashing happens even when we disable graphing completely > and disk I/O is near 0% on both the system partitions and the graph > partition. > > So I was wondering how I could go about figuring out why Nagios is > freaking out on our older server (Dell PowerEdge 1950). Our other > Master server (a Dell PowerEdge R710) gets all the same checks > reported to it, and handles it just fine, but it using much newer > Xeon CPUs, faster memory, etc. The old crashing server handles > things just fine for days at a time until it randomly runs itself > out of swap space and crashes. > > I know I really should get around to upgrading to Nagios 3.x but no > time for that yet and it's going to be a pain to upgrade them all at > once without being blind for a little bit, so pretend Nagios 3.x > isn't an option just yet. > > Thanks for any insight! > Jeremy > --- > --- > --- > - > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > ___ > Nagios-users mailing list > Nagios-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nagios-users > ::: Please include Nagios version, plugin version (-v) and OS when > reporting any issue. > ::: Messages without supporting info will risk being sent to /dev/null -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] bizarre Nagios 2.12 memory leak
We have a large distributed setup running Nagios 2.12 with 20 distributed servers sharing about 2 checks against 2500 hosts. They are reporting into multiple master Nagios servers using a modified OCP_daemon that handles multiple master servers. Recently we nearly doubled our number of distributed servers. Our number of checks had grown so we only were doing about 20-30% per minute on some of our most busy distributed servers. Now we are doing 90% per minute. Ever since we increased the frequency of all the checks, our oldest Master server has started crashing randomly every so often. Nothing else has changed. Memory use goes through the roof until eventually there is 0 swap left and the server finally crashes and has to be rebooted. If we restart the Nagios service while the memory usage is going crazy, it drops back down to normal for quite a while, but days later it will happen again. I started restarting Nagios on that server once an hour but it hasn't helped. We tried upgrading to 16 GB of RAM which has made this happen a bit less often, but it continues to happen sometimes. We are using NPCD to graph the performance data from all of our checks, but all the graph .RRD files are on a dedicated partition, and the crashing happens even when we disable graphing completely and disk I/O is near 0% on both the system partitions and the graph partition. So I was wondering how I could go about figuring out why Nagios is freaking out on our older server (Dell PowerEdge 1950). Our other Master server (a Dell PowerEdge R710) gets all the same checks reported to it, and handles it just fine, but it using much newer Xeon CPUs, faster memory, etc. The old crashing server handles things just fine for days at a time until it randomly runs itself out of swap space and crashes. I know I really should get around to upgrading to Nagios 3.x but no time for that yet and it's going to be a pain to upgrade them all at once without being blind for a little bit, so pretend Nagios 3.x isn't an option just yet. Thanks for any insight! Jeremy -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null