Re: [Nagios-users] Have we reached some kind of Nagios limit?

2012-02-20 Thread Daniel Wittenberg
Have you tried running in debug mode?

Dan

From: Frost, Mark {BIS} [mailto:mark.fro...@pepsico.com]
Sent: Saturday, February 18, 2012 11:48 AM
To: Nagios Users List
Subject: [Nagios-users] Have we reached some kind of Nagios limit?

A couple of days ago, I ran into a problem I've never seen before.  We run a 
single large instance with mostly very heterogeneous checks and host types.  
One particular group of Windows hosts, however, are all quite similar and they, 
like most of our other checks rely on the use of templates.  I needed to add 10 
more hosts of this particular type and typically all I have to do is just 
define the hosts and the service checks happen automatically as the host 
templates include them in a group that includes all the relevant checks.

I added maybe 5 of these new hosts, ran the pre-flight check and restarted.  
After the restart I started noticing that our failing service checks (for all 
services) went from around 260 to over 4K.  All of those new failing checks 
were only on hosts of this same type (that particular application on Windows 
servers I mentioned above which is also what these new hosts were part of) and 
they were all reporting the same failure condition:

(Return code of 127 is out of bounds - plugin may be missing)

Now ordinarily this would indicate a client-side issue, but there isn't one.  I 
can validate that by running check_nrpe manually against any of these hosts.   
I could imagine a typo that would cause this, particular against other existing 
hosts that had not been touched, but I double-checked and did not find one (I 
was just adding host definitions to this group - nothing else).

I cloned this environment and went to play with it in a non-production instance 
that was identical to the production Nagios instance except for a slight newer 
version of Merlin in the backend (1.1.14 for the non-prod instance, 1.1.13 
something for the production one), but both used the same Nagios 3.3.1 + 
downtime locking patches.   I was able to reproduce the situation and after a 
couple of days of trial and error I've still not been able to completely 
isolate the issue, but I've determined that


  *   it's not got anything to do with the mk-livestatus module (turned it off, 
turned it back on), but it's been very helpful in figuring out which of the 
13K+ services and 1200+ hosts are impacted
  *   it doesn't seem to be about adding random hosts and services.   I can add 
others and this doesn't happen
  *   the host definition uses a template that puts the host in a hostgroup.  
Those hostgroups are then used to in service definitions (12-15 services, 
depending on which group).   I had thought that perhaps if the hostgroup_name 
line of the service definition expanded to too many hosts that could be the 
problem.  I broke the service definitions down into 2 definitions, one for each 
production hostgroup rather than combining them and that didn't matter.
  *   the service templates that the service definitions use for these hosts 
all add them to a common servicegroup.  My current line of thinking leads me to 
believe it's got something to do with this.   With a particular test scenario I 
created where I create a new host, but exclude it from the hostgroup 
definitions and instead manually create service definitions for this host (I 
know this one more host is right on the cusp of this problem), I find that 
when I add it so the 4,331st service gets added to the servicegroup, the 
problem starts.  If I remove that from that host's service definition all the 
other hosts' services recover.   However, based on this thinking, if I just 
comment out the servicegroup add from the service template these hosts use, the 
problem should stop - it doesn't.
  *   the only affect services are on all of the hostgroup I'm changing.   
Other unrelated hosts and services are unaffected.   There are 3 hostgroups: 
Production Appname Hosts 1, Production Appname Hosts 2, and All Appname Hosts 
which is obviously a combination of the two.   All Appname Hosts is around 324 
hosts.

I'm not really sure what to try at this point.  It does seem like I've hit some 
kind of internal limitation with Nagios, but I don't know how to determine 
anything else about it beyond this.  If I were able to completely isolate this 
to say, not adding anything to a single servicegroup, I could avoid that and 
continue adding hosts as we need it, but I have so far not been able to find 
such a workaround.   If there is a limitation like this, it would of course, be 
nice for the pre-flight check to tell me that I can't have more than X members 
of a servicegroup or something.

Other info:

Nagios version: Nagios 3.3.1 with locking patches
Merlin backend: 1.1.13+ (production), 1.1.14 (test)
MK-Livestatus module 1.1.12p6 installed (uninstalled doesn't impact)
OS: SLES 11.1 Linux, 64-bit
Memory: 12GB
CPU: 2x 2.4Ghz quad-core Xeon

What can I do?

Thanks

Mark

Re: [Nagios-users] Have we reached some kind of Nagios limit?

2012-02-20 Thread Frost, Mark {BIS}
Thanks, Sven.  I'm almost certain you're correct.  For some reason I had 
thought that when I turned on large_installation_tweaks some time ago that 
environment variables were turned off.   However, now I see that it only turns 
off summary macros.   Not sure how I misinterpreted that.

So it would make sense that in the case of this particular collection of hosts 
and services that Nagios was probably creating such a large set of environment 
variables that it was perhaps overriding a shell limit and preventing the 
exec'd check from properly executing.  I did some preliminary tests and turning 
that off cleared things up.  And of course, in addition to fixing this issue, I 
believe I'm going to get a performance boost (or at least a resource usage 
drop) as an added bonus.

Unfortunately the only place I do currently use environment variables is with 
several event scripts.   Changing those scripts to use command-line variables 
is proving to be rather a pain in the butt given how many variables I have them 
check.   But I'm getting there.

Thanks very much for your help!

Mark

-Original Message-
From: Sven Nierlein [mailto:sven.nierl...@consol.de] 
Sent: Saturday, February 18, 2012 2:05 PM
To: Nagios Users List
Subject: Re: [Nagios-users] Have we reached some kind of Nagios limit?

On 2/18/12 18:48, Frost, Mark {BIS} wrote:
 ...
 I added maybe 5 of these new hosts, ran the pre-flight check and restarted.  
 After the restart I started noticing that our failing service checks (for all 
 services) went from around 260 to over 4K.  All of those new failing checks 
 were only on hosts of this same type (that particular application on Windows 
 servers I mentioned above which is also what these new hosts were part of) 
 and they were all reporting the same failure condition:
 (Return code of 127 is out of bounds - plugin may be missing)
 ...
 What can I do?

Disable environment macros. You hit the limit of maximum length of a new shell 
command which can be pretty huge when using env macros.
Its strongly advised to turn them off when using mklivestatus anyway.

  Sven

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

--
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Have we reached some kind of Nagios limit?

2012-02-18 Thread Frost, Mark {BIS}
A couple of days ago, I ran into a problem I've never seen before.  We run a 
single large instance with mostly very heterogeneous checks and host types.  
One particular group of Windows hosts, however, are all quite similar and they, 
like most of our other checks rely on the use of templates.  I needed to add 10 
more hosts of this particular type and typically all I have to do is just 
define the hosts and the service checks happen automatically as the host 
templates include them in a group that includes all the relevant checks.

I added maybe 5 of these new hosts, ran the pre-flight check and restarted.  
After the restart I started noticing that our failing service checks (for all 
services) went from around 260 to over 4K.  All of those new failing checks 
were only on hosts of this same type (that particular application on Windows 
servers I mentioned above which is also what these new hosts were part of) and 
they were all reporting the same failure condition:

(Return code of 127 is out of bounds - plugin may be missing)

Now ordinarily this would indicate a client-side issue, but there isn't one.  I 
can validate that by running check_nrpe manually against any of these hosts.   
I could imagine a typo that would cause this, particular against other existing 
hosts that had not been touched, but I double-checked and did not find one (I 
was just adding host definitions to this group - nothing else).

I cloned this environment and went to play with it in a non-production instance 
that was identical to the production Nagios instance except for a slight newer 
version of Merlin in the backend (1.1.14 for the non-prod instance, 1.1.13 
something for the production one), but both used the same Nagios 3.3.1 + 
downtime locking patches.   I was able to reproduce the situation and after a 
couple of days of trial and error I've still not been able to completely 
isolate the issue, but I've determined that

-   it's not got anything to do with the mk-livestatus module (turned it 
off, turned it back on), but it's been very helpful in figuring out which of 
the 13K+ services and 1200+ hosts are impacted
-   it doesn't seem to be about adding random hosts and services.   I can 
add others and this doesn't happen
-   the host definition uses a template that puts the host in a hostgroup.  
Those hostgroups are then used to in service definitions (12-15 services, 
depending on which group).   I had thought that perhaps if the hostgroup_name 
line of the service definition expanded to too many hosts that could be the 
problem.  I broke the service definitions down into 2 definitions, one for each 
production hostgroup rather than combining them and that didn't matter.
-   the service templates that the service definitions use for these hosts 
all add them to a common servicegroup.  My current line of thinking leads me to 
believe it's got something to do with this.   With a particular test scenario I 
created where I create a new host, but exclude it from the hostgroup 
definitions and instead manually create service definitions for this host (I 
know this one more host is right on the cusp of this problem), I find that 
when I add it so the 4,331st service gets added to the servicegroup, the 
problem starts.  If I remove that from that host's service definition all the 
other hosts' services recover.   However, based on this thinking, if I just 
comment out the servicegroup add from the service template these hosts use, the 
problem should stop - it doesn't.
-   the only affect services are on all of the hostgroup I'm changing.   
Other unrelated hosts and services are unaffected.   There are 3 hostgroups: 
Production Appname Hosts 1, Production Appname Hosts 2, and All Appname Hosts 
which is obviously a combination of the two.   All Appname Hosts is around 324 
hosts.

I'm not really sure what to try at this point.  It does seem like I've hit some 
kind of internal limitation with Nagios, but I don't know how to determine 
anything else about it beyond this.  If I were able to completely isolate this 
to say, not adding anything to a single servicegroup, I could avoid that and 
continue adding hosts as we need it, but I have so far not been able to find 
such a workaround.   If there is a limitation like this, it would of course, be 
nice for the pre-flight check to tell me that I can't have more than X members 
of a servicegroup or something.

Other info:

Nagios version: Nagios 3.3.1 with locking patches
Merlin backend: 1.1.13+ (production), 1.1.14 (test)
MK-Livestatus module 1.1.12p6 installed (uninstalled doesn't impact)
OS: SLES 11.1 Linux, 64-bit
Memory: 12GB
CPU: 2x 2.4Ghz quad-core Xeon

What can I do?

Thanks

Mark

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a 

Re: [Nagios-users] Have we reached some kind of Nagios limit?

2012-02-18 Thread Sven Nierlein
On 2/18/12 18:48, Frost, Mark {BIS} wrote:
 ...
 I added maybe 5 of these new hosts, ran the pre-flight check and restarted.  
 After the restart I started noticing that our failing service checks (for all 
 services) went from around 260 to over 4K.  All of those new failing checks 
 were only on hosts of this same type (that particular application on Windows 
 servers I mentioned above which is also what these new hosts were part of) 
 and they were all reporting the same failure condition:
 (Return code of 127 is out of bounds - plugin may be missing)
 ...
 What can I do?

Disable environment macros. You hit the limit of maximum length of a new shell 
command which can be pretty huge when using env macros.
Its strongly advised to turn them off when using mklivestatus anyway.

  Sven

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null