Re: [Nagios-users] Hostgroup tricks?
From: Tim AtLee [mailto:t.at...@cfertech.com] Sent: Tuesday, November 08, 2011 9:46 AM To: nagios-users@lists.sourceforge.net Subject: [Nagios-users] Hostgroup tricks? Hello I have a hostgroup defined as: define hostgroup { hostgroup_name ping-servers alias Pingable hosts members * } I have recently added a host outside our firewall that has ping disabled. I have changed the host's check_command to be 'check_tcp!80' so that the host won't be offline permanently, but I am wondering if there is a way to exclude this host from the 'ping-servers' hostgroup in the host definition? Ideally, something like: define host { host_name outsidefirewallhost alias Host outside firewall address some.ip.address use generic-host hostgroup!ping-servers } This generates an error when I test the configuration. The only way I have been able to achieve this is to change the ping-servers hostgroup definition to exclude this individual host (*,!outsidefirewallhost), but I'd rather keep the exclusion define in the host, not in the "blanket rule". Maybe it's just me being OCD... but is this possible? Thanks, Tim Tim, I'm a little unclear about your question. Are you trying to alter the "Host Check Command" for a single host definition? That is, the method used by Nagios to determine if a host is up or not? If that's the case, you can just override the definition for that one host: define host { host_name outsidefirewallhost alias Host outside firewall address some.ip.address check_command check-tcp-port-80 use generic-host } Check the docs for information on precedence, but your "generic-host" inclusion will specify a check_command (usually ping or better yet fping), but defining a different value in the definition itself can override that for the specific definition. If that's not what you mean, and you want to change a specific service to check everything in that hostgroup except that one host, that would look something like define service { hostgroup_name ping-servers host_name !outsidefirewallhost service_description My Service check_command run-a-ping use generic-service } Hopefully I've understood your question... Mark -- RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Suggestions for event correlation managers?
Splunk perhaps? Mark From: Furnish, Trever G [tgfurn...@herffjones.com] Sent: Tuesday, August 09, 2011 12:30 AM To: Nagios Users List Subject: Re: [Nagios-users] Suggestions for event correlation managers? Anyone? C'mon, don't be shy! :-) -- Trever From: Furnish, Trever G [tgfurn...@herffjones.com] Sent: Friday, August 05, 2011 4:45 PM To: nagios-users@lists.sourceforge.net Cc: Boeglin, Adam R Subject: [Nagios-users] Suggestions for event correlation managers? Hello, I'm looking for suggestions for applying Nagios' style of event handling (escalations, recoveries, acknowledgements), hopefully with some improvements (aggregation), to events coming from many different (non-Nagios) sources. I know of a few Nagios-specific notification aggregators, but can anyone recommend a good (preferably inexpensive / OSS) way of expanding that to include many other tools? I know about SNARE and RiverMuse, but they're relatively expensive. We make heavy use of Nagios as well as several other tools (MSFT SCOM, HP SIM, Oracle Grid Control, AlertSite.net, etc). They're all sending alerts in various forms to a small group of admins and engineers, so many of us get alerts from all of the tools, sometimes from more than one tool regarding a single event. Nagios does a great job of flexibly managing alerts from its own events, but I don't see how I'd hook in the other tools. Several of the tools (e.g. SCOM and SIM) don't even have any concept of event correlation -- breakage and recovery are two separate events. I see tools like SNARE, RiverMuse ECM, and a few others filling this gap, at least partially, but I don't yet have experience with them and they're relatively expensive. Anyone doing this effectively with OSS tools or low-cost tools or a good home-grown approach you wouldn't mind sharing (and possibly collaborating on)? -- Trever Furnish, tgfurn...@herffjones.com Herff Jones, Inc. Solutions Architect Phone: 317.612.3519 -- BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA The must-attend event for mobile developers. Connect with experts. Get tools for creating Super Apps. See the latest technologies. Sessions, hands-on labs, demos & much more. Register early & save! http://p.sf.net/sfu/rim-blackberry-1 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- uberSVN's rich system and user administration capabilities and model configuration take the hassle out of deploying and managing Subversion and the tools developers use with it. Learn more about uberSVN and get a free download at: http://p.sf.net/sfu/wandisco-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- uberSVN's rich system and user administration capabilities and model configuration take the hassle out of deploying and managing Subversion and the tools developers use with it. Learn more about uberSVN and get a free download at: http://p.sf.net/sfu/wandisco-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Services are dependent on the host they run on?
Maybe I'm missing something but I thought that suppressing notifications for services on the same host when the host goes down is the default behavior. It's only when you have to suppress notifications from different hosts that you need host/service dependencies. Mark -Original Message- From: Assaf Flatto [mailto:nag...@flatto.net] Sent: Thursday, May 26, 2011 1:39 PM To: Nagios Users List Subject: Re: [Nagios-users] Services are dependent on the host they run on? Martin Hugo wrote: > Hi Robi, > > I have never done it but I know you can make hosts/services children that > will not report if the parent is down. > > Hope this puts you on the right track. > > Marty > > -Original Message- > From: Roberto Nunnari [mailto:roberto.nunn...@supsi.ch] > Sent: Thursday, May 26, 2011 12:47 PM > To: nagios-users@lists.sourceforge.net > Subject: [Nagios-users] Services are dependent on the host they run on? > > Hi all. > > Some time ago, I've installed and configured nagios to monitor our IT > infrastructure. > > It works very well and we're happy with it. > > There's still one problem though: > When a host goes down, nagios sends notifications not only for host > down, but also for all services running on that host. When a host goes > down, I would like nagios to only send notifications about the host > down, and not for all the services running on that host. > > How can I achive that? > May it be a configuration error from my side? I thought that to nagios, > services would be dependent from the host running them.. > > Any hint/advice/guidance is very welcome. > > Thank you and best regards. > > Robi > check out service dependencies http://nagios.sourceforge.net/docs/3_0/dependencies.html -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios Core & Remedy Ticketing Integration
We don't use Remedy, but another ticketing system. In our case, the app (or someone who worked with the app), created a command-line script that you can use to create the actual ticket. I then created an event handler for a failing service to call that command line utility to create the ticket. You could really do this via a notification as well depending on what you want as long as you rewrite the notification command to call your "make me a ticket with these parameters" program instead of mail. I believe the key piece here either way, is whatever way Remedy provides you of opening a ticket from a command line, or worst case, via some web interface that you have Nagios login to in some automated fashion. Mark From: steve f [mailto:a31mod...@hotmail.com] Sent: Wednesday, February 23, 2011 1:23 PM To: nagios-users@lists.sourceforge.net Subject: [Nagios-users] Nagios Core & Remedy Ticketing Integration Hello All, Has anyone integrated Nagios with BMC Remedy for ticket creation? I am looking at ARCPerl for this since our Remedy infrastructure is not set up to receive e-mails. Has anyone tried this? Any info / horror stories would be appreciated. Thanks, Steve -- Free Software Download: Index, Search & Analyze Logs and other IT data in Real-Time with Splunk. Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. http://p.sf.net/sfu/splunk-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] check if SAP login is possible
Werner, I can't say that I'm an expert at any of these methods, but there are a few possibilities you might explore. - WebInject. It allows you to write these kind of request/response scripts that walk through interaction with a website, including a login. There's even some stuff about using it directly with Nagios on their site. - Perl and the WWW::Mechanize module. This allows you to do something similar to WebInject by writing your own Perl script that interacts with a website including "pressing" buttons, etc. I would also recommend Ton Voon's spiffy Nagios::Plugin module to handle the Nagios plugin duties. In either case, you would probably want to create some dummy/test user to attempt the login with. Mark -Original Message- From: Werner Flamme [mailto:werner.fla...@ufz.de] Sent: Wednesday, February 23, 2011 8:15 AM To: nagios-users@lists.sourceforge.net Subject: [Nagios-users] check if SAP login is possible -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi everyone, last week I had a new problem - all Nagios checks of the SAP systems succeeded, but no one was able to login or to work inside SAP. The users got a timeout message, but remained logged in. The usual checks via check_sap_cons still delivered their standard output. How can I check if a SAP login is possible or not? As a first step, I check the https:// login screen (with check_http), but how can I check that a user may log in after seeing that screen? BTW, the reason was a shared filesystem full to the brim... Regards, Werner -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.15 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org/ iEYEARECAAYFAk1lCFQACgkQk33Krq8b42MZlgCfSoyg7yByXygupxaM7C7wFxqB TfkAnRMQiAvorypMZfkAo9jbzTuH+zcc =ZMmj -END PGP SIGNATURE- -- Free Software Download: Index, Search & Analyze Logs and other IT data in Real-Time with Splunk. Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. http://p.sf.net/sfu/splunk-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Free Software Download: Index, Search & Analyze Logs and other IT data in Real-Time with Splunk. Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. http://p.sf.net/sfu/splunk-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] CPU monitor for a single Linux user space process ?
> -Original Message- > From: Bruce Edge [mailto:bruce.e...@gmail.com] > Sent: Wednesday, December 15, 2010 8:06 PM > > Rookie question here. Trying to determine nagios suitability for an > embedded app. > > Can I monitor the CPU utilization for a single user space process on a > Linux box with nagios? > And, can I define an action if it exceeds a threshold? > > Thanks > > -Bruce Bruce, I'm not sure that there's an existing check plugin that would do this (might be). I can say that "yes" you can do this, it's just a question of what you're willing to do. If I were to do this for our environment, I'd write a perl script that used the 'ps' command to look at the process and pull the 'pcpu' field (% cpu -- see the 'ps' man page) info for that process. I'd also use the Nagios::Plugin perl module to make the Nagios side easier and probably report the actual pcpu value as performance data suitable for graphing. You could then configure the an event on that service check. That essentially another script that gets called when the state changes on the check. This means it gets called anytime the state changes, including when it goes to an "OK" state so you need to have the script detect when it's called and potentially exit if it hasn't gone into a hard critical state (depending on what you want, actually). You can read up on events on the Nagios documentation. Mark -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] converting distributed Nagios setup to Nagios+Merlin
Our site currently uses a somewhat traditional distributed Nagios setup. I'm setting up merlin on some new Nagios servers and am looking at what configurations I'm going to want to change. As part of that, I realize that there are some Nagios config directives that I wanted some clarification on before I started changing things. I haven't seen these documented elsewhere (at least not that I could find). I was looking for clarification on the following: 1) Obsessive (ocsp/ochp) configuration directives get turned off. Merlin does all that. Plus ocsp/ochp is deemed detrimental to performance making that another reason to turn it off. 2) Freshness checking. Nagios would probably still try to do this if I left it in, but there's no point since Merlin will also do this. 3) Passive/Active checks. If I understand things correctly under Merlin everything is an active check. Or rather, anything that Nagios is supposed to run on some host or another is an active check. Things that are truly sent via NSCA from some monitored host out there would still be passive, but otherwise everything's configured to run actively Merlin takes care of where it runs. 4) In a load balanced/redundant configuration (such as 'yoda' and 'obi' in the HOWTO doc), which of 'yoda' or 'obi1' sends out notifications? Or do they both send them out but Merlin somehow only has one of them send it? I'm guessing that this is handled in the more traditional way where notifications are enabled on say, 'yoda' but disabled on 'obi1'. If 'yoda' crashes, you manually enable the alerts via the command file on 'obi1'? It would of course be super-cool if Merlin handled all that :-). 5) Other parameters such as process_perf_data - still probably only on the master(s), but that's really up to how crazy we'd want be. event handler settings - unchanged by this configuration retain status information - unchanged by this configuration Thanks Mark -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] distributed nagios ?
> -Original Message- > From: Andreas Ericsson [mailto:a...@op5.se] > Sent: Wednesday, December 15, 2010 4:46 AM > > On 12/14/2010 08:39 PM, Frost, Mark {PBC} wrote: >> >> Hooray! >> >> Actually, I wanted to point out a few things I found when building the >> most recent version of merlin recently. At the heart of my issues >> is that our team is not allowed root access on these servers (long boring >> corporate story...) so I'm installing everything in an alternate tree. >> >> 1) There are a couple of hard-coded paths in ipc.c and node.c for >> the socket and the binlogs. I'm assuming that's intentional, but it >> does mean one has to manually edit the source files to point to different >> paths rather than specifying anything like that during the build process. >> > > The socket location can be configured. Binlogs cannot. I'll amend that in > the next release though. The core functionality is there, but there's no > option to set it in the config files, which is kinda stupid. "Binlogs cannot" meaning it can't be moved without modifying the code directly, right? Because that's what I did :-). >> 2) Because we're trying to put all the files into an alternate tree, the >> installation of 'mon' from install-merlin.sh didn't really work right. > Yes. The install-merlin.sh script is designed to be usable from the > rpm spec file, and it's meant to aid people who want to install > everything in its default location. Would $root_path/$bindir/mon > work for you? Since you can set $root_path to whatever you want, > I suppose it should. Yes, I believe that would work for me. I'm not setting $root_path at all. Thanks, Andreas. Mark -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] distributed nagios ?
> -Original Message- > From: Andreas Ericsson [mailto:a...@op5.se] > Sent: Tuesday, December 14, 2010 4:49 AM > To: nagios List; doc...@yahoo.com > Subject: Re: [Nagios-users] distributed nagios ? > >> Any pointers to docs on how to set it up? >> > > http://git.op5.org/git/?p=nagios/merlin.git;a=blob;f=HOWTO;hb=master > http://git.op5.org/git/?p=nagios/merlin.git;a=blob;f=README;hb=master > https://wiki.op5.org/merlin:start#guides > > If I were you, I'd wait til tomorrow with installing it though, when 1.0.0 > is released as stable. Reading up on the docs and whatnot beforehand is > still a good idea though. > > -- > Andreas Ericsson andreas.erics...@op5.se > OP5 AB www.op5.se > Tel: +46 8-230225 Fax: +46 8-230231 Hooray! Actually, I wanted to point out a few things I found when building the most recent version of merlin recently. At the heart of my issues is that our team is not allowed root access on these servers (long boring corporate story...) so I'm installing everything in an alternate tree. 1) There are a couple of hard-coded paths in ipc.c and node.c for the socket and the binlogs. I'm assuming that's intentional, but it does mean one has to manually edit the source files to point to different paths rather than specifying anything like that during the build process. 2) Because we're trying to put all the files into an alternate tree, the installation of 'mon' from install-merlin.sh didn't really work right. In our case, it made a lot more sense to change cp apps/mon.py $root_path/usr/bin/mon to cp apps/mon.py $bindir/mon otherwise it would put 'mon' in a really weird spot. I'm guessing these are design decisions on your part, but in case they're not, I thought I'd point them out. Thanks Mark -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] high latency
> -Original Message- > From: Andreas Ericsson [mailto:a...@op5.se] > Sent: Tuesday, December 07, 2010 5:57 PM > To: Frost, Mark {PBC} > Cc: Nagios Users List > Subject: Re: [Nagios-users] high latency > > > > > Any chance that the OP5 site will eventually be > > configured to allow git through a proxy? It's of course less convenient to > > use snapshot tarballs, but still workable, of course. > > > > You mean through http? Doesn't it already? I think it's supposed to. I can > check > up on that later. The gitweb page has links for grabbing latest master as a > tarball though. That might work as an interim solution. > > -- > Andreas Ericsson andreas.erics...@op5.se > OP5 AB www.op5.se > Tel: +46 8-230225 Fax: +46 8-230231 Andreas, It's just never worked for me and I thought you'd mentioned some time ago that OP5's git site just didn't support it. I've validated that my version of git (1.7.1) will grab code from a public site via our corporate proxy using other public code (the proxy is setup via the $http_proxy environment variable): $ git clone http://github.com/schacon/grack.git Initialized empty Git repository in /home/mfrost0/src/grack/.git/ remote: Counting objects: 85, done. remote: Compressing objects: 100% (45/45), done. remote: Total 85 (delta 32), reused 80 (delta 31) Unpacking objects: 100% (85/85), done. but... $ git clone http://git.op5.org/nagios/merlin.git merlin-src Initialized empty Git repository in /home/mfrost0/src/merlin-src/.git/ fatal: http://git.op5.org/nagios/merlin.git/info/refs not found: did you run git update-server-info on the server? $ git clone http://git.op5.org/nagios.git nagios-src Initialized empty Git repository in /home/mfrost0/src/nagios-src/.git/ fatal: http://git.op5.org/nagios.git/info/refs not found: did you run git update-server-info on the server? so, you know :-( Thanks Mark -- Oracle to DB2 Conversion Guide: Learn learn about native support for PL/SQL, new data types, scalar functions, improved concurrency, built-in packages, OCI, SQL*Plus, data movement tools, best practices and more. http://p.sf.net/sfu/oracle-sfdev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] high latency
> -Original Message- > From: Andreas Ericsson [mailto:a...@op5.se] > Sent: Tuesday, December 07, 2010 9:44 AM > > > Hmm. So then I'd be so curious why the 2 distservers which are both using > > oc[sh]p commands the same way have such radically different latencies. > > > > Agreed. There must be other differences too. Perhaps there's trouble resolving > from one of the nodes? That usually makes checks run a helluva lot longer than > they normally have to. I had another look. While I found a test host that I'd made that was deliberately unreachable, I found that when I removed it it made no difference. Execution times are significantly lower (min/max/avg) on the host with the high latencies than for the one with low latencies. I don't see any unresolvable hosts or now, any unreachable hosts. Puzzling. I've always wished there was an easy way to see which processes had high latencies from the web interface without having to view the status.dat file... > > Either way, you're suggesting that having a NEB module handle the > > post-check work will eliminate the serialization. > Yes. Sneaking a peak at what's needed in order for an event to get sent to > master via an eventbroker compared to running an oc[sh]p command renders > this, more or less: > [ good stuff snipped...] Wow. > In terms of effort, the difference is sort of like either hopping on one > leg along the entire great wall of china or walking to the kitchen and grab > a beer. > > > > parallelize_check is set to 1 everywhere. > > Does one server have a lot of random service failures? On-demand hostchecks > are > still run in parallel. I don't think so. Intermittent you mean? Not as far as I know or can see. > > > What version of Nagios are you running? > > > > 3.2.1 > > I take it upgrading makes no difference? To 3.2.3? I'll probably try that on the new servers, but if things work out I may just move to Merlin + 3.2.4. I wasn't sure I saw anything in the 3.2.3 release that I found compelling for us at the time. As I say, this system now has fairly high visibility so just trying something like that would involve a rather painful internal change process. It's like piloting the QE2 -- I can't change course very quickly :-) > > Thanks, Andreas. I'm hoping to allocate sufficient resources on the new > > servers > > to be able to play with Merlin more there. > > It's quite resource-friendly actually. Well, compared to what you're running > now > it's positively feather-light. I meant more like installing MySQL everywhere, building filesystems to hold the MySQL data, etc. Not so much like I need more memory or more CPUs. I don't remember seeing anything in the Merlin docs (maybe I missed it), but how large would the MySQL database need to be? Pretty small on each box, right? Like 500MB or less? > > Will I be able to have the performance > > data from a poller be sent up to a NOC for digestion by pnp4nagios? > > Yes, but you'll need the threadsafe version of Nagios you can obtain from > either > CVS or git://git.op5.org/nagios.git for performance-data to work. Actually, > you > need that for Merlin to work. That's part of the plan. Any chance that the OP5 site will eventually be configured to allow git through a proxy? It's of course less convenient to use snapshot tarballs, but still workable, of course. > > It may have > > been a long time ago, but I thought I remember seeing that performance data > > was > > not yet implemented. > > > > That was then. This is now :) Spifftacular! > > No we'd be using some flavor of SLES. > > > > Should work marvellously then. Thanks as always for your help, Andreas. Mark -- What happens now with your Lotus Notes apps - do you make another costly upgrade, or settle for being marooned without product support? Time to move off Lotus Notes and onto the cloud with Force.com, apps are easier to build, use, and manage than apps on traditional platforms. Sign up for the Lotus Notes Migration Kit to learn more. http://p.sf.net/sfu/salesforce-d2d ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] high latency
> -Original Message- > From: Andreas Ericsson [mailto:a...@op5.se] > Sent: Monday, December 06, 2010 6:06 AM > To: Nagios Users List > Cc: Frost, Mark {PBC} > Subject: Re: [Nagios-users] high latency > > On 12/03/2010 08:14 PM, Frost, Mark {PBC} wrote: > > > > I too struggle with them and I'm running on lightly-loaded physical > > hardware. > > We have 2 servers doing the checks sending back to a central server. Both > > distributed nodes use ocsp/ochp, but they do nothing more than append > > results > > to a file (i.e. it exits quickly). Results are handled outside of Nagios. > > > > Try getting rid of the oc[sh]p commands and use Merlin or google for "pnsca" > or > "persistent nsca". There's one available from op5's repositories that may or > may > not work, and there's one from somewhere else that they're apparently using to > great effect. > > Even if it exits quickly, it's still executed serially, so checking halts a > small period of time for each and every check that runs. Hmm. So then I'd be so curious why the 2 distservers which are both using oc[sh]p commands the same way have such radically different latencies. Either way, you're suggesting that having a NEB module handle the post-check work will eliminate the serialization. > > What's odd is that distserver 1 and distserver 2 are configured the same > > > > distserver1: > > Hosts Checked 675 > > Services Checked: 4179 > > Active Service Latency: 0.000 / 3.155 / 0.382 sec > > Active Service Execution Time: 0.000 / 60.038 / 0.145 sec > > > > distserver2: > > Hosts Checked: 261 > > Services Checked: 4289 > > Active Service Latency: 0.000 / 169.977 / 81.300 sec > > Active Service Execution Time: 0.000 / 15.270 / 0.211 sec > > > > yet as you can see, distserver2's latency is much higher and always has > > been. > > I tried turning off EPN yesterday on distserver2 and it had no discernable > > effect. > > We added 400 new service checks yesterday on distserver2 (just more of the > > same > > checks we already do but on 26 new hosts) and the latency went from 35 to > > over 80. > > > > What kind of checks are you running? Some plugins draw a lot of cpu. > Are any of the checks set to run in serial (grep for parallelize_check in your > objects.cache file). parallelize_check is set to 1 everywhere. Most things are NRPE checks (also NRPE to NSClient++). Some are locally running perl scripts and others are locally running things like check_http. > What version of Nagios are you running? > 3.2.1 > > The checks we do are very different (Windows, Linux, Unix, many are > > app-centric) so > > it's difficult to compare exactly what runs on distserver1 and distserver2, > > but given > > the jump that was taken yesterday, I'm wondering if the fact that the type > > of checks > > on these new hosts are all built on dependencies make me wonder if that > > doesn't > > have something to do with it. These hosts (Windows) have a basic check for > > NRPE > > and all other checks on the host are dependent on the NRPE check succeeding. > > > > I have to move to all new Nagios servers very soon. I'm interested in > > Merlin, but > > given its non-production nature just yet, I'm hesitant to commit and I'm > > not sure if > > it will help me here. > > > It's been running at our 400+ customers with very few problems for the past > month. > 0.9.1, released just yesterday, solves the known issues our customers have > encountered. You might want to take a look at it again. There are some issues > on > FreeBSD though (was that you reporting them?). I just recently got a new > laptop > with better support for running virtual systems, so I'm downloading a FreeBSD > 8.1 > install dvd as we speak. Hopefully I'll have those issues sorted out before > the > end of the week. > > -- > Andreas Ericsson andreas.erics...@op5.se Thanks, Andreas. I'm hoping to allocate sufficient resources on the new servers to be able to play with Merlin more there. Will I be able to have the performance data from a poller be sent up to a NOC for digestion by pnp4nagios? It may have been a long time ago, but I thought I remember seeing that performance data was not yet implemented. No we'd be using some flavor of SLES. Thanks Mark -- What happens now with your Lotus Notes apps - do you mak
Re: [Nagios-users] high latency
Can the use of dependencies also be the cause of increased latencies? I too struggle with them and I'm running on lightly-loaded physical hardware. We have 2 servers doing the checks sending back to a central server. Both distributed nodes use ocsp/ochp, but they do nothing more than append results to a file (i.e. it exits quickly). Results are handled outside of Nagios. What's odd is that distserver 1 and distserver 2 are configured the same distserver1: Hosts Checked 675 Services Checked: 4179 Active Service Latency: 0.000 / 3.155 / 0.382 sec Active Service Execution Time: 0.000 / 60.038 / 0.145 sec distserver2: Hosts Checked: 261 Services Checked: 4289 Active Service Latency: 0.000 / 169.977 / 81.300 sec Active Service Execution Time: 0.000 / 15.270 / 0.211 sec yet as you can see, distserver2's latency is much higher and always has been. I tried turning off EPN yesterday on distserver2 and it had no discernable effect. We added 400 new service checks yesterday on distserver2 (just more of the same checks we already do but on 26 new hosts) and the latency went from 35 to over 80. The checks we do are very different (Windows, Linux, Unix, many are app-centric) so it's difficult to compare exactly what runs on distserver1 and distserver2, but given the jump that was taken yesterday, I'm wondering if the fact that the type of checks on these new hosts are all built on dependencies make me wonder if that doesn't have something to do with it. These hosts (Windows) have a basic check for NRPE and all other checks on the host are dependent on the NRPE check succeeding. I have to move to all new Nagios servers very soon. I'm interested in Merlin, but given its non-production nature just yet, I'm hesitant to commit and I'm not sure if it will help me here. Thanks Mark -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] different notification_intervals by contact
From: Duncan Berriman [mailto:dun...@dcl.co.uk] Sent: Wednesday, November 10, 2010 1:00 PM To: 'Nagios Users List' Subject: Re: [Nagios-users] different notification_intervals by contact Escalations are a little pesky to get working correctly. Here is an example. ... Thanks, Duncan. I've decided to take a somewhat different approach. Ultimately, what they want is for the pager to occur at 4x the frequency of the e-mail (15 minutes versus 1 hour). So this doesn't wind up being all that hard if I make a contact that calls a simple shell script. That shell script then looks at the NOTIFICATIONNUMBER to (in this case) determine if it's a multiple of 4 and if so, sends the alert. In fact, I'm going to make this so that's going to take an argument to determine what number to perform 'modulo' on. So in theory this could be reused if someone wanted to have something run every other notification number, every 6th, etc, indefinitely. The downside as I see it is that Nagios won't quite have an accurate representation of who got what notifications. From Nagios' perspective, it sent an alert to the mailing list, but really, the script acts as a gateway to determine if a message was actually sent. So the "Notifications" for the host/service as shown in the UI will not be quite correct. But I think they can live with that. Mark -- Centralized Desktop Delivery: Dell and VMware Reference Architecture Simplifying enterprise desktop deployment and management using Dell EqualLogic storage and VMware View: A highly scalable, end-to-end client virtualization framework. Read more! http://p.sf.net/sfu/dell-eql-dev2dev___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] different notification_intervals by contact
So we're setting up some Nagios checks for a new team and they're asking for something new that I'm not really sure we can do with Nagios. For any production alerts they want to receive pager alerts every 15 minutes and e-mail alerts every 60 minutes. Since each host/service definition has only a single notification_interval setting and contact definitions don't allow a notification_interval setting, I don't see how this can be done within that context. We don't currently use escalations for anything, but I've been staring at them and trying to figure out how that might work for us. In terms of using escalations to solve this problem I'm struck by several issues: - I'd be trying to use escalations to setup an indefinite pattern, not a system where there's an last_notification where everyone gets the notifications. - I have to do this for a lot of hosts/services and it doesn't look like I can wildcard service_descriptions (tried it and it failed). My other thought is to just have 2 checks for the same service where check A has the 15-minute notification_interval and goes to pagers and check B has a 1-hour notification_interval and goes to e-mail. And that's for a lot of services. I can't really do the duplicate checks on hosts. But either way, you know, "yuck". I keep thinking there's some easier more obvious solution that's eluding me to this. Is something that anyone else has solved? I'm inclined to tell them that we can't do this and get them to unify on one notification_interval like everyone else, but before I do, I thought I'd ask. Thanks Mark -- The Next 800 Companies to Lead America's Growth: New Video Whitepaper David G. Thomson, author of the best-selling book "Blueprint to a Billion" shares his insights and actions to help propel your business during the next growth cycle. Listen Now! http://p.sf.net/sfu/SAP-dev2dev___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Scheduled checks falling far behind
Benny, OK, well, I hope I'm not embarrassing myself with this. It's a perl script and uses Ton Voon's nifty Nagios::Plugins module. I run checks against things I want to know about. Thinking about it, I guess it would be nice to have the failed hosts/services check alert on percentage of failures. Maybe someday. Mark -Original Message- From: C. Bensend [mailto:be...@bennyvision.com] Sent: Saturday, October 23, 2010 8:44 PM To: nagios-users@lists.sourceforge.net Subject: Re: [Nagios-users] Scheduled checks falling far behind > You can also run, if memory serves, the "nagiostats" command located in > your Nagios "bin" directory to see this information as well. I actually > use that nagiostats data in a custom check and graph a lot of those > latencies and other Nagios performance related info. Boy, would I *love* to see your method for that! I personally hacked the source of nagiostats to create a custom plugin, but it's a horrible, horrible hack and I'd like to see a cleaner, more scalable method. Can you share? Benny -- "No matter how many shorts we have in the system, my guards will be instructed to treat every surveillance camera malfunction as a full-scale emergency." -- Peter Anspach's Evil Overlord List, #67 -- Nokia and AT&T present the 2010 Calling All Innovators-North America contest Create new apps & games for the Nokia N8 for consumers in U.S. and Canada $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store http://p.sf.net/sfu/nokia-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null #!/usr/bin/perl -w # Nagios Plugin script to check several Nagios statistics use strict; use warnings; use Nagios::Plugin; use vars qw( $VERSION $PROGNAME $NAGIOS_BASE $NAGIOSSTATS $warning $critical %nagios_stats_data ); $VERSION = '0.32'; # the location of the Nagios installation root $NAGIOS_BASE = '/usr/local/nagios'; # the command to run to get 'nagiostats' $NAGIOSSTATS = $NAGIOS_BASE . '/bin/nagiostats'; # get the base name of this script for use in the examples use File::Basename; $PROGNAME = basename($0); sub load_nagiostats; sub do_cached_host_checks; sub do_cached_service_checks; sub do_command_buffers; sub do_execution_time; sub do_failed_hosts; sub do_failed_services; sub do_host_latency; sub do_hosts_checked; sub do_service_latency; sub do_services_checked; # Instantiate Nagios::Plugin object (the 'usage' parameter is mandatory) my $p = Nagios::Plugin -> new ( usage => "Usage: %s [ --verbose | -v ] [ --debug | -d ] [ --cached-host-checks] | [ --cached-service-checks ] | [ --command-buffers ] | [ --execution-time] | [ --failed-hosts ] | [ --failed-services ] | [ --hosts-checked ] | [ --host-latency ] | [ --service-checked ] | [ --services-latency ]", version => $VERSION, blurb => 'Nagios plugin to check various Nagios statistics.', extra => " THRESHOLDs are specified 'min:max' or 'min:' or ':max' (or 'max'). If specified '\...@min:max', a warning status will be generated if the count *is* inside the specified range." ); # Define and document the valid command line options # usage, help, version, timeout and verbose are defined by default. $p -> add_arg( spec => 'cached-host-checks', help => 'Check that number of cached host checks matches the threshold.' ); $p -> add_arg( spec => 'cached-service-checks', help => 'Check that number of cached service checks matches the threshold.' ); $p -> add_arg( spec => 'command-buffers', help => 'Check that number of used command buffers matches the threshold.' ); $p -> add_arg( spec => 'execution-time', help => 'Check the host and service execution times.' ); $p -> add_arg( spec => 'failed-hosts', help => 'Check that number of failed hosts matches the threshold.' ); $p -> add_arg( spec => 'failed-services', help => 'Check that number of failed services matches the threshold.' ); $p -> add_arg( spec => 'host-latency', help => 'Check that average host latency is within the threshold.' ); $p -> add_arg( spec => 'hosts-checked', help => 'Check that number of hosts checked matches the threshold.' ); $p->add_arg( spec => 'services-checked', help => 'Check that number of servi
Re: [Nagios-users] Scheduled checks falling far behind
Matthew, You don't say, but my guess would be that you have high latencies. That is for one of several reasons, Nagios is not able to run checks when it thinks it should. You can see this information and other stats by looking at the Performance item near the bottom of the Nav pane in the Nagios web interface. You can also run, if memory serves, the "nagiostats" command located in your Nagios "bin" directory to see this information as well. I actually use that nagiostats data in a custom check and graph a lot of those latencies and other Nagios performance related info. >From my own experience, I found that I did not pay attention to this >information when I started using Nagios, then read about it, made a few tweaks >to make it better then forgot about it. Then as our installation grew and >grew, I found that some things got worse again and I had to consider different >tuning options. I would recommend that you first read the "Tuning Nagios For Maximum Performance" section of the docs: http://nagios.sourceforge.net/docs/3_0/tuning.html If nothing else, this will give you an idea of some things that can affect latencies. Additionally, you may find that you see your average latencies, but then see something with a whopping huge max latency. It can be hard to track down what that is in the UI. I've just looked up that max latency and then quickly looked in the status.dat file to find the service that had that same matching latency and dug into that. You could, for example, have a few checks that aren't really timing out so the check may take 10 minutes or more to complete which would really screw up your overall latencies. Like the checks wouldn't have finished before the next time they were supposed to be run. Mark From: Litwin, Matthew [mlit...@stubhub.com] Sent: Friday, October 22, 2010 8:29 PM To: nagios-users@lists.sourceforge.net Subject: [Nagios-users] Scheduled checks falling far behind I have been chasing my tail trying to figure out why my RRD files were very sparsely populated, and I am realizing that my checks are falling behind of their scheduled times up to 3 times their set check interval. For example a service that should be checking every 5 minutes. In the example below, the time is 00:19:02, the last check was 00:10:30 and the next scheduled check time is 00:13:28. This means it is almost 6 minutes behind schedule and almost 9 minutes since the last check! I find even if I shorten the check interval to say 3 minutes it still behaves about the same. The server has very low load and nagios is hardly working at all. (usually below 4% cpu) I haven't touch any of the tuning on this and from what I have read the default settings appear unthrottled. Is there any way to make it "work harder"? --Service information-- Last Updated: Sat Oct 23 00:19:02 UTC 2010 --Service State Information-- Current Status: OK (for 7d 16h 14m 46s) Status Information: CPU STATISTICS OK : user=0.12% system=0.00% iowait=0.00% idle=99.88% Performance Data: 0.12;0.00;0.00;99.88;80;90 Current Attempt:1/3 (HARD state) >>> Last Check Time:10-23-2010 00:10:30 Check Type: ACTIVE Check Latency / Duration: 612.633 / 2.052 seconds >>> Next Scheduled Check: 10-23-2010 00:13:28 <<< Last State Change: 10-15-2010 08:04:16 Last Notification: N/A (notification 0) Is This Service Flapping? NO (0.00% state change) In Scheduled Downtime? NO Last Update:10-23-2010 00:18:33 ( 0d 0h 0m 29s ago) -- Nokia and AT&T present the 2010 Calling All Innovators-North America contest Create new apps & games for the Nokia N8 for consumers in U.S. and Canada $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store http://p.sf.net/sfu/nokia-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Nokia and AT&T present the 2010 Calling All Innovators-North America contest Create new apps & games for the Nokia N8 for consumers in U.S. and Canada $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store http://p.sf.net/sfu/nokia-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. :
[Nagios-users] question about slow startup and retained data
After adding a fair number of hosts/services based on templates -- all with a number of dependent services -- we're seeing Nagios taking a fair amount of time to start up now. We're using Nagios 3.2.1. Startup times seemed to be in the vicinity of 4 minutes. During that time Nagios chews up 100% of one CPU core and eventually 2 CPU cores, then settles down. I assumed it was time for me to investigate the fast-startup options and deal with at least the dependency checking. Note that this host in question is the "central" node in a distributed setup so virtually everything it gets is a passive check result. When I tried starting with '-s', I found that the first block (Object Config Processing Times) went very quickly and then it hung on the second block (Retention Data Times) ran for a while as indicated. Everything else after that seemed to go fairly quickly to my surprise. So apparently, my problem is with retained data. My relevant nagios.cfg entries are as follows: retain_state_information=1 retention_update_interval=60 use_retained_program_state=1 use_retained_scheduling_info=1 retained_host_attribute_mask=0 retained_service_attribute_mask=0 retained_process_host_attribute_mask=0 retained_process_service_attribute_mask=0 retained_contact_host_attribute_mask=0 retained_contact_service_attribute_mask=0 So we definitely want to make use of historical data. I see in the config file comment that using retained state may come at the cost of increased startup times. None of the speedup options I see seem to say that they try to help startup time with retained status. Am I stuck? Do I either need to live with the 194-second processing time (and that will go up as we add more hosts/services over time) or do without retained data? nagios -s output: Object Config Source: Config files (uncached) OBJECT CONFIG PROCESSING TIMES (* = Potential for precache savings with -u option) -- Read: 0.042305 sec Resolve: 0.008222 sec * Recomb Contactgroups: 0.002196 sec * Recomb Hostgroups:0.006549 sec * Dup Services: 0.026616 sec * Recomb Servicegroups: 0.289033 sec * Duplicate:0.012538 sec * Inherit: 0.005349 sec * Recomb Contacts: 0.00 sec * Sort: 0.01 sec * Register: 0.076975 sec Free: 0.008261 sec TOTAL:0.478046 sec * = 0.350505 sec (73.32%) estimated savings RETENTION DATA TIMES -- Read and Process: 194.016362 sec TOTAL:194.016362 sec Timing information on configuration verification is listed below. CONFIG VERIFICATION TIMES (* = Potential for speedup with -x option) -- Object Relationships: 0.054524 sec Circular Paths: 0.822133 sec * Misc: 0.005159 sec TOTAL:0.881816 sec * = 0.822133 sec (93.2%) estimated savings EVENT SCHEDULING TIMES - Get service info:0.014405 sec Get host info info: 0.001732 sec Get service params: 0.13 sec Schedule service times: 0.000801 sec Schedule service events: 0.000461 sec Get host params: 0.01 sec Schedule host times: 0.000144 sec Schedule host events:0.38 sec TOTAL: 0.017595 sec Projected scheduling information for host and service checks is listed below. This information assumes that you are going to start running Nagios with your current config files. HOST SCHEDULING INFORMATION --- Total hosts: 870 Total scheduled hosts: 19 Host inter-check delay method: SMART Average host check interval: 300.00 sec Host inter-check delay: 15.79 sec Max host check spread: 30 min First scheduled check: Sat Oct 16 23:26:24 2010 Last scheduled check:Sat Oct 16 23:28:46 2010 SERVICE SCHEDULING INFORMATION --- Total services: 7569 Total scheduled services: 34 Service inter-check delay method: SMART Average service check interval: 292.94 sec Inter-check delay: 8.62 sec Interleave factor method: SMART Average services per host: 8.70 Service interleave factor: 1 Max service check spread: 30 min First scheduled check: Sat Oct 16 23:31:16 2010 Last scheduled check: Sat Oct 16 23:33:26 2010 CHECK PROCESSING INFORMATION Check result reaper interval: 2 sec Max concurrent service checks: Unlimited PERFORMANCE SUGGESTIONS --- I have no suggestions - things look okay. Thanks Mark
Re: [Nagios-users] Alleviating Nagios i/o contention problem
> -Original Message- > From: Marc Powell [mailto:li...@xodus.org] > Sent: Sunday, September 26, 2010 11:27 AM > To: Nagios Users List > Subject: Re: [Nagios-users] Alleviating Nagios i/o contention problem > > > On Sep 25, 2010, at 10:53 AM, Max wrote: > > > I like the suggestions Matthias makes; those suggestions have worked > > well for us. > > > > RRD updates are very expensive - I am pretty sure without knowing > > anything more about your system that the RRD writes are causing most > > of the I/O load. > > I no longer have access to this system but my experience has been > otherwise. We were running a nagios install with nearly 10,000 services > received by external pollers every 5 minutes, and a cricket install on > the same machine polling/updating 100,000+ rrd files during the same > interval. This was on a Poweredge 6850, 5 disk RAID-5. > RRDtool itself writes very little data to disk. I think it's 8 Bytes > per DS per RRA updated. Linux, though, wants to write 4KB chunks at a > time so it performs a read-modify-write of 4KB just to update those 8 > Bytes. > > The OP can reduce his IO load particularly for RRD updates and help > Linux better organize it's writes to disk by ensuring that he has > enough RAM to keep key information for each RRD file in the filesystem > cache. The OP will need at least 8K * number of rrd files available to > be used as filesystem buffer cache. > > -- > Marc Thanks very much to all who replied (Breandan, Marc, Max and Matthias, this means you! :-) ). - I can't say exactly how many checks create perfdata (we have a very heterogeneous set of check types). I can see 9K files in the graph data filesystem, so that would be about 4,500. - I'm not running updates through syslog. I don't have root on these machines so that would not be helpful to me. I will have to double-check, but I don't believe that I have writing to the pnp4nagios turned on, except maybe for the lowest level. I don't recall it logging much of anything at that level, but as I say, I'll check. - According to our performance analysis team, these servers have way more RAM that they're actually using so I wouldn't think I'm limited by the Linux disk cache here. Perhaps it's just the hardware we have (the i/o rates on a 3-year-old Dell 2950 with a single RAID 5 set) that makes this particularly bad for us. Perhaps on faster hardware we'd not even notice. - I would assume that the rrdcached was built for a reason (i.e. this i/o issue was observed at least somewhere) so it's definitely an avenue I want to try out. - The ramdisk idea is also interesting. I'm curious though, about why one would want to rsync it back to the local disk periodically. It's just a run-time status file, right? Unless I misread the docs, it goes away when Nagios is shut down. What would having a local disk copy of status.dat benefit me? Also, nagios.log isn't written to that often in our case (we don't log passive check results, for example). I'm not sure I'd see the benefit for us in putting that on ramdisk. Although... we do have Splunk watch that file so that would be some additional read overhead I guess. Thanks! Mark -- Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Alleviating Nagios i/o contention problem
Greetings, listers, We've got an on-going issue with i/o contention. There's the obvious problem that we've got a whole lot of things all writing to the same partition. In this case, there's just one big chunk of RAID 5 disk on a single controller so I don't believe that making more partitions is going to help. On this same partition we have: 1) Nagios 3.2.1 running as the central/reporting server for a couple of other Nagios nodes that are sending check results via NSCA. Approximately 6-7K checks. 2) pnp4nagios 0.6.2 (with rrd 1.4.2) writing graph data. There's a 2nd server configured identically to the first that's acting as a "hot spare" so it also receives check data from the 2 distributed nodes and writes its own copy of the graph data locally as well. At the moment I'm concerned about the graphdata, but because I can only see i/o utilization as an aggregate, I can't tell what is the worst component on that filesystem -- status.dat updates? graph data? writes to the var/spool directory? We also look at continued growth so this is only going to get worse. These systems are quite lightly loaded from a CPU (2 dual-core CPUs) and memory (4GB) perspective, but the i/o to the nagios filesystem is queuing now. We're about to order new hardware for these servers and I want to make a reasonable choice. I'd like to make some reasonable changes without requiring too exotic of a setup. I believe these servers are currently Dell 2950s and they're all running Suse Linux 10.3 SP2. My first thought was to potentially move the graphs to a NAS share which would shift that i/o to the network. I don't know how that would work though and it would ultimately be an experiment. What experiences do people out there have handling this kind of i/o and what have you done to ease it? Thanks very much! Mark -- Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Running Nagios on Vmware
In my experience, there are weird things that happen with timing. That is, the time on a VM should be sync'd with a time source so no time is lost. However, the VM has what I like to think of as "seconds of variable length". So when we tested with a VM a few years ago, the latency and execution timings and calculations were really screwy. There were checks that Nagios thought ran in "-0.15" seconds, for example. Considering that this was information that we cared about, we chose to stick with a physical box. And yes, I/O is now an increasing concern for us so a VM would be even less likely. That said, I know another team who has much lighter requirements (they just want alerts, don't care about latencies (yet)) and they've been on a VM for years now with Nagios. Mark -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Merlin/Ninja perfdata status?
> -Original Message- > From: Andreas Ericsson [mailto:a...@op5.se] > Sent: Friday, June 11, 2010 9:29 AM > To: Nagios Users List > Subject: Re: [Nagios-users] Large Installation > > > Unless you desperately need performance data from satellite systems > handled properly, I'd invite you to give Merlin and Ninja a try. Andreas, We're planning on a Nagios refresh/rearchitecture near the end of this year and I'm really hopeful that we might be able to move to Ninja/Merlin as they do a lot of things we'd really like to have. They also solve some issues we have with our current distributed system. I've been trying to pay attention to the latest developments in this area, but I may have missed something as changes are happening quickly. We do, however, rely pretty heavily on performance data. I think I saw someone had a hack to do it with Merlin, but it's not really part of Merlin right now which makes me not want to adopt it for a production Nagios installation. I recall a sort of Merlin roadmap for the rest of the year indicating that upcoming work was to better support distributed setups, if I remember correctly. Is there also work afoot to get perfdata into Merlin perhaps with the next release? I'm trying to build some test systems to try the current version of Merlin/Ninja to assess how "production ready" it might be for us by the end of the year when we need to make a decision. Thanks very much for all the hard work you and others at Op5 have put in to these tools. Mark -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Overly persistent contact group
Mordur, Two thoughts on this. First, I find that I've been burned many times by contact/contactgroup inheritance. That is, where you define a contactgroup for a host and that gets inherited by the service (when I don't want it to). Second, I rely a lot on looking at the "Configuration" link at the bottom of the Nagios web interface. That lets you look and see what's really defined for all the objects (hosts, services, contacts, contactgroups, timeperiods, etc). Essentially it allows me to go in and compare what I intended to say in the configuration with what Nagios really is using. Mark -Original Message- From: mli...@1984.is [mailto:mli...@1984.is] Sent: Wednesday, May 26, 2010 2:50 PM To: nagios-users@lists.sourceforge.net Subject: [Nagios-users] Overly persistent contact group Dear list, I have nagios3 on Debian Lenny. I created a service template and host template for a customer as well as a couple of contacts and a contact group. I specified the contact group in the host and service template and created some host and service defininitions based on the aforementioned templates. So I hoped that notifications would be sent to these new contacts as per the setup descibed above. This hope failed, and notifications were only sent to a 'admins' contactgroup, which is not specified anywhere in the setup of those hosts, services, contacts, group or template. When I remove the 'admins' contact group from the config files and run a test of the config, I get this: Error: Contact group 'admins' specified in service 'SYSTEM STATUS' for host 'host.domain.tld' is not defined anywhere! Even though this contact group is mentioned nowhere in connection with these hosts or services. It seems that all contact groups except the one named 'admins' fail to register with the Nagios system and that the 'admins' contact group is somehow automatically associated with all host definitions, regardless of which contact group is actually specified in configuration. Mordur Ingolfsson -- ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] trying to fix problem with excessive latency
> -Original Message- > From: Corey Hickey [mailto:bugfood...@fatooh.org] > Sent: Tuesday, May 18, 2010 9:30 PM > To: nagios-users@lists.sourceforge.net > Subject: [Nagios-users] trying to fix problem with excessive latency > > Hello, > > I have inherited maintenance of a medium-sized Nagios installation. We > currently have 649 hosts and 5415 services. Our setup works nicely, with > one exception: Nagios falls behind on host/service checks. Our usual > latency once Nagios has been running for a while is about 190-200 > seconds. Our Nagios host is reasonably powerful and isn't struggling; it > seems that Nagios itself is limited somehow. > > Active Service Execution Time: 0.020 / 120.007 / 0.847 sec > Active Host Execution Time: 0.020 / 11.019 / 0.069 sec > > I have a feeling I'm missing something I would appreciate any > suggestions. > > Thanks, > Corey Corey, I'm not an expert, but I'll relay some of my own experiences here. I did find that switching on large_installation_tweaks did indeed make a big difference with our latencies. We also were doing the pre-Nagios 3.2 practice of not doing active host checks. As the tuning guide recommends, it's actually more efficient to do active checks and then enable the cached check results. When we did that, we found that the host that we were seeing latency issues on leveled out on latencies. (It's good to graph those values, by the way). They were still high-ish, but the active host checks caused them to stop increasing over time. But additionally, we found that long running checks were also messing up latencies. As I understand it, if Nagios schedules a check and then it takes a lot longer than Nagios expects it to to return, that can mess up scheduling the other checks. I see you've got some check(s) that ran at a max of 120 seconds. When I started seeing some latency problems I also saw that I had a service check or two that was running for several minutes. I tracked that down and changed the check so that it completed (or timed out, really) more quickly returning status back to Nagios in a matter of seconds rather than minutes. The latency plummeted after that. In general, our policy is that most checks should complete in under 30 seconds, preferably under 10. In the same vein, I'm not quite sure how you could have any host checks that would take 11 seconds to execute. Are you doing multiple pings/fpings to check that a host is up? Typically you can get away with just a single fping rather than a series of 10 to tell you that a host is not reachable. Hope that helps. Mark -- ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] turning off service inheritance of host settings?
I don't suppose there's any way (short of changing the source and recompiling) to turn off the "feature" of inheriting host settings to services? This is one thing I've found *really* annoying about 3.2.0 and would like to have a way to turn it off. I didn't see anything in the docs or in the nagios.cfg file that let me turn this behavior on or off or something I could put in a host or service setting that would let me disable it. Thanks Mark -- ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Does anyone have event log monitors that *work*?
>-Original Message- >From: C. Bensend [mailto:be...@bennyvision.com] >Sent: Friday, March 19, 2010 10:32 AM >To: nagios-users@lists.sourceforge.net >Subject: [Nagios-users] Does anyone have event log monitors that *work*? > > >Hey folks, > > I have been beating my head against various and sundry walls, >tables, and desks for quite some time now, and my brain is starting >to get very, VERY mushy. > > I need to monitor Windows event logs. You'd think this would >be easy, but either the tools available out there don't work (which >I doubt, I KNOW you monitor event logs), or I'm man enough to admit >that I'm a hopeless idiot. > > I've tried to get help on the 3rd-party sites (Steve >Shipway's site for Nagios EventLog Service and NSClient++), but >they're either away from their desks for an extended period of >time or I've just plain worn them out and they're no longer answering >my questions. > > I beg of you; if you use either of these tools and *successfuly* >monitor Windows event logs, please give me a hand. I apologize for >the length of this email, but this is my last stand - if I cannot >get event log monitoring working, this entire project may get >scrapped. Benny, This is probably overkill for your situation but you could use Splunk to watch event logs (and other logs) via saved searches and then have it notify Nagios when it spots something. We do this here as Splunk just has more smarts about dealing with events/logs/matches within certain time windows. But as I say, it IS more overhead than the other solutions you cite. Mark -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] check_http and proxy
I had this same issue (trying to check sites with SSL through a proxy). Unfortunately, it appears that the issue is that check_http does not support the 'CONNECT' tunneling protocol that our proxy servers require for that service. I'm not really sure what other options exist to do this. Like, for instance if WWW::Mechanize would allow it either. I wish check_http did, though. Mark -Original Message- From: Leo Stolk [mailto:leo.st...@enovation.nl] Sent: Wednesday, March 17, 2010 10:34 AM To: Nagios Users List Subject: Re: [Nagios-users] check_http and proxy Hi, You could try to use --ssl in the check. check_http --ssl -H my.proxy -p my_proxy_port -u http://my.website Greetings, Leo -Oorspronkelijk bericht- Van: Marc-André Doll [mailto:m...@b-care.net] Verzonden: woensdag 17 maart 2010 14:17 Aan: Nagios-Users Onderwerp: [Nagios-users] check_http and proxy Hi list, I'm trying to check some web applications through a proxy with check_http (version 1.4.13). I googled it and found that, with version 1.4.8, it might be possible to try this ( http://www.mail-archive.com/nagios-users@lists.sourceforge.net/msg11186.html ) : check_http -H my.proxy -p my_proxy_port -u http://my.website Unfortunately, I have to access my application through HTTPS. So I tried with check_http -H my.proxy -p my_proxy_port -u https://my.website -vvv And I obtained this message : > GET https://my.website HTTP/1.0 > User-Agent: check_http/v2053 (nagios-plugins 1.4.13) > Connection: close > Host: my.proxy:my_proxy_port > > > http://my.proxy:my_proxy_porthttps://my.website > STATUS: HTTP/1.1 400 Bad Request > [] Does soemone know how I should deal with this ? Thank you, Marc-André -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios DST bug and upcoming DST-time switch in Europe
>-Original Message- >From: Ton Voon [mailto:ton.v...@opsera.com] >Sent: Tuesday, March 09, 2010 3:33 AM >To: Nagios Users Mailinglist >Subject: Re: [Nagios-users] Nagios DST bug and upcoming DST-time switch in >Europe > > >On 9 Mar 2010, at 08:19, Mark Elsen wrote: > >> Nagios 3.2.0 >> >> >> - By the end of the month, Europe will switch to DST. >> Will I be affected by the Nagios DST-BUG which , which results in >> NAGIOS becoming >> dis-functional ? >> >> Which countermeasures can I take to prevent being struck by this >> problem ? >> > >The bug, which stops all Nagios monitoring for 24 hours, occurs when >"time moves backwards". This does not happen when "time moves >forward". However, any other timeperiods (such as 09:00-17:00) maybe >incorrect by an hour, which is obviously not as serious. This is true >for Nagios 3.2.0. > >Ton Unless I'm mistaken, we had this timeperiod problem occur here. Some alerts were sent during a timeperiod for which notifications are not enabled. I did restart Nagios after DST went into affect just for fun, but that was before these alerts went out. I know that the fix for this when this bug was evidenced last fall was to run a script against one of the .dat files (retention.dat?). However, that was for the monitoring problem. Is there something similar that we need to do to correct timeperiods? Thanks Mark -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null