Re: [Nagios-users] Nagios2 process overwhelmed by NSCA daemon?
Are you running the full nagios on the slaves? Do the checks seem to be working on those hosts? Greg Pangrazio pangr...@gmail.com On Wed, Dec 9, 2009 at 5:06 PM, Jonathan Call jc...@verio.net wrote: I recently added two new slaves to a distributed Nagios system. The central server now passively processes 17,000+ service checks on 3000+ servers. It's been over an hour and a half since I brought those new slaves online and I have about 150 hosts still stuck in 'Pending' and about 1300 services in the same state. In addition to that it seems that the service check results from the other slaves that were working normally are now arbitrarily disappearing. For example, on one host three of the service checks have been updated relatively recently (i.e. 5-30 minutes ago) but three other service checks haven't been updated for almost an hour. The slaves all appear operational and the hosts are being checked on time. Is it possible I've overwhelmed Nagios' ability to process data from the NSCA daemon or struck some internal Nagios bottleneck? Any suggestions would be appreciated. Jonathan This email message is intended for the use of the person to whom it has been sent, and may contain information that is confidential or legally protected. If you are not the intended recipient or have received this message in error, you are not authorized to copy, distribute, or otherwise use this message or its attachments. Please notify the sender immediately by return e-mail and permanently delete this message and any attachments. Verio, Inc. makes no warranty that this email is error or virus free. Thank you. -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios2 process overwhelmed by NSCA daemon?
In my last job, I was dealing with a nagios install a little bit over than yours, On Wed, Dec 9, 2009 at 9:06 PM, Jonathan Call jc...@verio.net wrote: I recently added two new slaves to a distributed Nagios system. The central server now passively processes 17,000+ service checks on 3000+ servers. It's been over an hour and a half since I brought those new slaves online and I have about 150 hosts still stuck in 'Pending' and about 1300 services in the same state. In addition to that it seems that the service check results from the other slaves that were working normally are now arbitrarily disappearing. For example, on one host three of the service checks have been updated relatively recently (i.e. 5-30 minutes ago) but three other service checks haven't been updated for almost an hour. The slaves all appear operational and the hosts are being checked on time. Is it possible I've overwhelmed Nagios' ability to process data from the NSCA daemon or struck some internal Nagios bottleneck? Any suggestions would be appreciated. With 4K servers and just over 24K service checks, with 12 or 13 distributed servers. Well, I've ran into many kinds of problems because of nagios poor design of distributed monitoring setup. Appears that distributed setup was done almost as a poor patch just to have to overcome some limitation . We ended up doing some custom passive plugins. They were built to send status information updates just in case of state change. In that way the load on NSCA side was very much reduced (it was Load Balanced with a Virtual IP, batch updates, but problems would still occur). This set of plugins were a little hard to mantain, because configuration of each server needed to be at the monitored server, puppet ftw. All checks were logged and later synchronized with ndo to have last checks history. NDO and the database schema has had to be modified too. The volume of inserts was way too high to be handled correctly in a timely manner, recurrent restarts of the database causing staled results, every sort of problem in managing those systems, even after a thorough tunning of the database. After adding logic to update only when state change ocurred, and another batch update to update last check and the fields that needed to be updated with last check information, the database load was normalized and scalability could be proven. So what I'd suggest to you, is to first tweak with the large installation procedures, tmpfs for the status.dat, objects.cache, retention.dat, setting batch jobs to send_nsca output to central/master nagios instance, and so on. Also, you can do some nagios setup magic aswell, having distributed nodes checking in a frequency (normal_check_interval) different than central nagios expects, say, setup central nagios to wait for status information on 30 minutes frequency, but have the distributed nodes to send them at 15 minutes freq., something like that. For what I know, it's really a cumbersome job to have enterprise scalability nagios configuration. For tiny and trivial installs it's like using Zennoss or Zabbixx, but with a lot of extra configuration-files pain. I think that no other competitor's tool (Z*bbnn*ssxx) would scale too when you need enterprise huge installs, so nagios is a little ahead and gives flexibility, but with an associated cost that scares anyone (ending up buying another tool to much less for much more). That's why I've liked Gabès Jean's Shinken approach to have scalability and to ease interoperability with puppet. That would be the übber-super-mega-ultra tool. Also, with nginx and asynchronicity of front-end, back-end, and checks, would end up with the most robust, easy, enterprise NMS. So, Gèan, continue on that path to have your Shinken working with backcompatibility with nagios setups, but also think ahead on design to have puppet integrated to handle configuration convergence (maybe eventhandlers too?). Cheers, M -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] check_snmp with regular expression
List, I am trying to use check_snmp plugin with the following regular expression and I am getting an error, can someone point out what am I doing wrong. Thanks /usr/lib64/nagios/plugins/check_snmp -H hostname -C community -o .1.3.6.1.2.1.1.6.0 -r ^*.some string*$ Could Not Compile Regular Expressioncheck_snmp: Could not parse arguments -- Cordially, Shadhin Rahman -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] check_snmp with regular expression
did you mean ^*.some string.*$ notice the period before the second * Greg Pangrazio pangr...@gmail.com On Thu, Dec 10, 2009 at 9:18 AM, shadih rahman shadhi...@gmail.com wrote: List, I am trying to use check_snmp plugin with the following regular expression and I am getting an error, can someone point out what am I doing wrong. Thanks /usr/lib64/nagios/plugins/check_snmp -H hostname -C community -o .1.3.6.1.2.1.1.6.0 -r ^*.some string*$ Could Not Compile Regular Expressioncheck_snmp: Could not parse arguments -- Cordially, Shadhin Rahman -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] check_snmp with regular expression
It looks like you're trying to match some string, no matter where it appears in the document. In that case, anchoring to line beginning and end is just extra work. Simply match on some string, and you're good to go. The asterisk is a modifier to the dot, so it needs to come after that. So the regex you pasted should probably be ^.*some string.*$, but this is functionally equivalent to some string. Regards, Martin Melin On Thu, Dec 10, 2009 at 4:18 PM, shadih rahman shadhi...@gmail.com wrote: List, I am trying to use check_snmp plugin with the following regular expression and I am getting an error, can someone point out what am I doing wrong. Thanks /usr/lib64/nagios/plugins/check_snmp -H hostname -C community -o .1.3.6.1.2.1.1.6.0 -r ^*.some string*$ Could Not Compile Regular Expressioncheck_snmp: Could not parse arguments -- Cordially, Shadhin Rahman -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios2 process overwhelmed by NSCA daemon?
Yes, Full Nagios is running on the slaves. They use OCP_daemon to pass on data to the central server since the NSCA client can't hack the load. They seem to be sending data properly to the NSCA daemon. Part of the issue I've tracked down to the status.cgi. The central server appears to be underpowered when it comes to both having Nagios process data AND have several people pounding out host/service status queries from the web interface. I will be adding another CPU to see if this helps, however I'm dismayed that Nagios on the central server doesn't seem to be reporting any errors, or indicating that there is any problem processing passive results. Nagios just starts to lose the data at a certain point. Jonathan -Original Message- From: Greg Pangrazio [mailto:pangr...@gmail.com] Sent: Thursday, December 10, 2009 7:26 AM To: Jonathan Call Cc: nagios-user Mailinglist Subject: Re: [Nagios-users] Nagios2 process overwhelmed by NSCA daemon? Are you running the full nagios on the slaves? Do the checks seem to be working on those hosts? Greg Pangrazio pangr...@gmail.com On Wed, Dec 9, 2009 at 5:06 PM, Jonathan Call jc...@verio.net wrote: I recently added two new slaves to a distributed Nagios system. The central server now passively processes 17,000+ service checks on 3000+ servers. It's been over an hour and a half since I brought those new slaves online and I have about 150 hosts still stuck in 'Pending' and about 1300 services in the same state. In addition to that it seems that the service check results from the other slaves that were working normally are now arbitrarily disappearing. For example, on one host three of the service checks have been updated relatively recently (i.e. 5-30 minutes ago) but three other service checks haven't been updated for almost an hour. The slaves all appear operational and the hosts are being checked on time. Is it possible I've overwhelmed Nagios' ability to process data from the NSCA daemon or struck some internal Nagios bottleneck? Any suggestions would be appreciated. Jonathan This email message is intended for the use of the person to whom it has been sent, and may contain information that is confidential or legally protected. If you are not the intended recipient or have received this message in error, you are not authorized to copy, distribute, or otherwise use this message or its attachments. Please notify the sender immediately by return e-mail and permanently delete this message and any attachments. Verio, Inc. makes no warranty that this email is error or virus free. Thank you. - - Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null This email message is intended for the use of the person to whom it has been sent, and may contain information that is confidential or legally protected. If you are not the intended recipient or have received this message in error, you are not authorized to copy, distribute, or otherwise use this message or its attachments. Please notify the sender immediately by return e-mail and permanently delete this message and any attachments. Verio, Inc. makes no warranty that this email is error or virus free. Thank you. -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Nagios as a Service Resiliency Manager
Hi all, I have a need to control an Active / Passive pair of components and was wondering if anyone had tackled this problem with Nagios? The scenario is as follows; Host A has SERVICE_1 installed and running. Host B has SERVICE_2 installed, but not running. The desired functionality is to detect when SERVICE_1 is not running (or that Host A is down / unreachable), and then to start SERVICE_2 on Host B. I believe I can do this with Nagios by defining an event handler on SERVICE_1 which will make the appropriate call to start SERVICE_2 on Host B Would it make sense to store the relationship between SERVICE_1 and Host B / SERVICE_2 as a service macro, e.g. $_SERVICE_PASSIVE_HOSTNAME, $_SERVICE_PASSIVE_SERVICENAME? There are too many scenarios in which the SERVICE_1 might come back up to try automate the switching off of SERVICE_2 I believe, e.g. if someone pulled a network cable on Host A accidently, then plugged it in 15 minutes later - during which time Nagios detects that it is down and so starts up SERVICE_2. The user then plugs the network lead back in and now we have two Active instances running - which is what we specifically wanted to avoid. Even if Nagios detects that the primary component is up, it's still too late because any Active / Active overlap will cause problems for this particular application. I can't think of any way to automate that side of things - but does the general concept of having Nagios start up a Passive partner make sense? Thanks for any insight you have, Chris -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios as a Service Resiliency Manager
Maybe this would help: http://onlamp.com/onlamp/2006/05/25/self-healing-networks.html On Thu, Dec 10, 2009 at 3:08 PM, Christopher McAtackney crist...@gmail.comwrote: Hi all, I have a need to control an Active / Passive pair of components and was wondering if anyone had tackled this problem with Nagios? The scenario is as follows; Host A has SERVICE_1 installed and running. Host B has SERVICE_2 installed, but not running. The desired functionality is to detect when SERVICE_1 is not running (or that Host A is down / unreachable), and then to start SERVICE_2 on Host B. I believe I can do this with Nagios by defining an event handler on SERVICE_1 which will make the appropriate call to start SERVICE_2 on Host B Would it make sense to store the relationship between SERVICE_1 and Host B / SERVICE_2 as a service macro, e.g. $_SERVICE_PASSIVE_HOSTNAME, $_SERVICE_PASSIVE_SERVICENAME? There are too many scenarios in which the SERVICE_1 might come back up to try automate the switching off of SERVICE_2 I believe, e.g. if someone pulled a network cable on Host A accidently, then plugged it in 15 minutes later - during which time Nagios detects that it is down and so starts up SERVICE_2. The user then plugs the network lead back in and now we have two Active instances running - which is what we specifically wanted to avoid. Even if Nagios detects that the primary component is up, it's still too late because any Active / Active overlap will cause problems for this particular application. I can't think of any way to automate that side of things - but does the general concept of having Nagios start up a Passive partner make sense? Thanks for any insight you have, Chris -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] obsessive acknowledgment processing
Hi, We are currently forwarding checks from multiple Nagios sites into a central location to create a consolidated view for our operations team. Some sites have their own operations teams as well who acknowledge issues from time to time. I set up a contact attached to all services and created a simple notification command that fires an external command on the central server. This works great for checks with notifications enabled, but if notifications are disabled for the service, it obviously does not forward the acknowledement. I looked for an obvious way to work around this but did not find one. Is there anything that works similar to ocsp but includes acknowledgments? Thanks, Cris -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null