Re: [Nagios-users] Passive-only master still pinging

2012-10-24 Thread Mike Lindsey
In about a month I'll be getting an official request to have freshness 
checking turned on, with all the commands being something like:

echo "Stale check!";exit 2

So, unfortunately ensuring that the master never ever accidentally runs 
a check is important, even beyond the current issue of hosts without routes.

On 10/23/12 7:34 PM, booleanena...@gmail.com wrote:
> You can always set the check command for the host to execute the check_dummy 
> plugin so that if it doesn't get the result and decides to run an active 
> check check_dummy will force it to be in an up state.
> Sent on the Sprint® Now Network from my BlackBerry®
>
> -----Original Message-
> From: Mike Lindsey 
> Date: Tue, 23 Oct 2012 14:00:36
> To: Nagios Users List
> Reply-To: Nagios Users List 
> Subject: [Nagios-users] Passive-only master still pinging
>
> I've got a passive-only master that is configured to never execute
> checks.  Yet it's still performing ping checks for some hosts at some
> times.  This is mostly just annoying, but when it decides to ping hosts
> that it doesn't have a route to, pagers go off.
>
> I've got 30k services in this config, so debug isn't really an easy option.
>
> Seeing this on 3.3.1.  Any ideas?
>
> # excerpt from nagios.cfg
> accept_passive_host_checks=1
> cached_host_check_horizon=15
> check_for_orphaned_hosts=0
> check_host_freshness=0
> enable_predictive_host_dependency_checks=0
> execute_host_checks=0
> host_inter_check_delay_method=s
> max_host_check_spread=30
> obsess_over_hosts=0
> passive_host_checks_are_soft=1
> translate_passive_host_checks=0
> use_aggressive_host_checking=0
>


-- 
Mike Lindsey


--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Passive-only master still pinging

2012-10-23 Thread Mike Lindsey
I've got a passive-only master that is configured to never execute 
checks.  Yet it's still performing ping checks for some hosts at some 
times.  This is mostly just annoying, but when it decides to ping hosts 
that it doesn't have a route to, pagers go off.

I've got 30k services in this config, so debug isn't really an easy option.

Seeing this on 3.3.1.  Any ideas?

# excerpt from nagios.cfg
accept_passive_host_checks=1
cached_host_check_horizon=15
check_for_orphaned_hosts=0
check_host_freshness=0
enable_predictive_host_dependency_checks=0
execute_host_checks=0
host_inter_check_delay_method=s
max_host_check_spread=30
obsess_over_hosts=0
passive_host_checks_are_soft=1
translate_passive_host_checks=0
use_aggressive_host_checking=0

-- 
Mike Lindsey


--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Segmentation Fault on config verification

2012-10-22 Thread Mike Lindsey
Looks I had a hostgroup that listed itself as a hostgroup member.  There 
were 11 other hostgroup members, and 4220 char temp_hostgroup->members 
and newmembers strings.

In xdata/xodtemplate.c, in 
xodtemplate_recombobulate_hostgroup_subgroups() the error was occurring 
in the while loop at:
"""
strcat(temp_hostgroup->members, newmembers);
"""
Not entirely sure what the root cause of the segmentation fault 
(fragmented memory?) might be, but updating my configuration to not 
include self-referential hostgroups has resolved the issue.

On 10/22/12 12:17 PM, Mike Lindsey wrote:
> Seeing this on 3.3.1, and 3.4.1.  Tried to reproduce with 4, but can't
> build from the current git repository.
>
> Migrating from obj_file to obj_dir style nagios.cfg, and on validation
> of my Master configuration I get a Segmentation fault, that looks to be
> coming right after Nagios closes nagios.cfg.
>
> The same format of configuration, generated from the same script works
> fine for poller nodes.  The main differences in the poller node
> configuration is size, no escalations, and no dependencies.
>
> The end of the truss output:
> mmap(0x0,783,PROT_READ,MAP_PRIVATE,5,0x0)= 34365812736 (0x8005cb000)
> munmap(0x8005cb000,783)  = 0 (0x0)
> close(5) = 0 (0x0)
> stat("/usr/local/ironport/akeos/bin/tmp/ops-mon-nagios1.vega/timeperiods/workhours.cfg",{
> mode=-rw-r--r-- ,inode=829664,size=389,blksize=4096 }) = 0 (0x0)
> open("/usr/local/ironport/akeos/bin/tmp/ops-mon-nagios1.vega/timeperiods/workhours.cfg",O_RDONLY,00)
> = 5 (0x5)
> fstat(5,{ mode=-rw-r--r-- ,inode=829664,size=389,blksize=4096 }) = 0 (0x0)
> mmap(0x0,389,PROT_READ,MAP_PRIVATE,5,0x0)= 34365812736 (0x8005cb000)
> munmap(0x8005cb000,389)  = 0 (0x0)
> close(5) = 0 (0x0)
> getdirentries(0x4,0x800d27000,0x1000,0x800d15668,0x80aece00,0x7fffe410)
> = 0 (0x0)
> lseek(4,0x0,SEEK_SET)= 0 (0x0)
> close(4) = 0 (0x0)
> munmap(0x8005c9000,) = 0 (0x0)
> close(3)     = 0 (0x0)
> SIGNAL 11 (SIGSEGV)
> process exit, rval = 0
>
> I'm digging into the source, but if anyone has any ideas, I'm ears.
>


-- 
Mike Lindsey


--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Segmentation Fault on config verification

2012-10-22 Thread Mike Lindsey
Seeing this on 3.3.1, and 3.4.1.  Tried to reproduce with 4, but can't 
build from the current git repository.

Migrating from obj_file to obj_dir style nagios.cfg, and on validation 
of my Master configuration I get a Segmentation fault, that looks to be 
coming right after Nagios closes nagios.cfg.

The same format of configuration, generated from the same script works 
fine for poller nodes.  The main differences in the poller node 
configuration is size, no escalations, and no dependencies.

The end of the truss output:
mmap(0x0,783,PROT_READ,MAP_PRIVATE,5,0x0)= 34365812736 (0x8005cb000)
munmap(0x8005cb000,783)  = 0 (0x0)
close(5) = 0 (0x0)
stat("/usr/local/ironport/akeos/bin/tmp/ops-mon-nagios1.vega/timeperiods/workhours.cfg",{
 
mode=-rw-r--r-- ,inode=829664,size=389,blksize=4096 }) = 0 (0x0)
open("/usr/local/ironport/akeos/bin/tmp/ops-mon-nagios1.vega/timeperiods/workhours.cfg",O_RDONLY,00)
 
= 5 (0x5)
fstat(5,{ mode=-rw-r--r-- ,inode=829664,size=389,blksize=4096 }) = 0 (0x0)
mmap(0x0,389,PROT_READ,MAP_PRIVATE,5,0x0)= 34365812736 (0x8005cb000)
munmap(0x8005cb000,389)  = 0 (0x0)
close(5) = 0 (0x0)
getdirentries(0x4,0x800d27000,0x1000,0x800d15668,0x80aece00,0x7fffe410)
 
= 0 (0x0)
lseek(4,0x0,SEEK_SET)= 0 (0x0)
close(4) = 0 (0x0)
munmap(0x8005c9000,) = 0 (0x0)
close(3) = 0 (0x0)
SIGNAL 11 (SIGSEGV)
process exit, rval = 0

I'm digging into the source, but if anyone has any ideas, I'm ears.

-- 
Mike Lindsey


--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] check_http throwing 141 exit on ssl error

2012-09-14 Thread Mike Lindsey
On 9/14/12 11:25 AM, Justin T Pryzby wrote:
> This may be unrelated to the question of why it's exiting with a
> nonstandard, out of range exit status, but is port 83 really HTTP over
> SSL?  It seems as if the plugin sent an ssl initiation, and the remote
> side closed the connection (perhaps because it wasn't ssl?).
>
> Later, the plugin tried to gracefully end the ssl session, but the
> socket was already closed (ECONNRESET), resulting in EPIPE, which I
> think is expected.
When the remote device isn't in this current state that's causing it to 
close inbound connections immediately after the socket is opened, yes, 
that's an https port.

On 9/14/12 11:54 AM, Andreas Ericsson wrote:
> On 09/14/2012 08:09 PM, Mike Lindsey wrote:
>> I'm typically used to seeing this kind of error code for a missing
>> plugin, but I've got a device that is accepting tcp connections and then
>> due to a local misconfiguration, immediately closing them.
>>
>> But rather than a normal critical I'm getting:
>> """
>> (Return code of 141 is out of bounds)
>> """
>>
> SIGPIPE has sig id 13. When a program catches a signal, it returns
> the sigid as a negative number, but the field for the exit status
> is unsigned, so it gets translated to 128 + sigid instead.
>
> As I read it back, I realize that doesn't exactly make supersense
> to anyone not familiar with integer math as computers do it, but
> I can assure you that's the reason.
Yup, makes sense now, and if I'd bothered to hit the man page, I'd have 
groked that.  As it was I just assumed 141 was SIGPIPE, so almost there 
but with an invalid (and irrelevant assumption).

>> When run by hand I have:
>> """
>> root@ops-mon-nagios3 /usr/local/nagios/libexec $ ./check_http -H
>> device.domain.com -w "10" -c "20" -S -p "83" -f follow
>> CRITICAL - Cannot make SSL connection
>> root@ops-mon-nagios3 /usr/local/nagios/libexec $ echo $?
>> 141
>> """
>>
>> write(1, "CRITICAL - Cannot make SSL conne"..., 39) = 39
>> write(3, "\200w\1\3\1\0N\0\0\0
>> \0\0009\0\0008\0\0005\0\0\26\0\0\23\0\0\n\7\0\300"..., 121) = -1 EPIPE
>> (Broken pipe)
>> --- SIGPIPE (Broken pipe) @ 0 (0) ---
>> +++ killed by SIGPIPE +++
>>
> And there's the SIGPIPE. Case closed.
>

Would it be appropriate for the check (and potentially any other 
nagios-plugins check that opens a socket) to trap SIGPIPE and return a 
normal valid critical?

As is, any http or https (or smtp or ldap, etc) check that's hitting a 
device behaving in this manner, is going to display a non-useful message 
in the Nagios UI, instead of the actual critical output. This is an 
error condition on a remote host, covered by what is normally valid 
working monitoring.  If this should be more cleanly caught by lower 
level parent dependency monitoring, how?  check_tcp returns 'ok' because 
the port opens.

If this is expected and desired behavior, should the output be updated 
to not include the misleading 'CRITICAL' prefix?

-- 
Mike Lindsey


--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] check_http throwing 141 exit on ssl error

2012-09-14 Thread Mike Lindsey
I'm typically used to seeing this kind of error code for a missing 
plugin, but I've got a device that is accepting tcp connections and then 
due to a local misconfiguration, immediately closing them.

But rather than a normal critical I'm getting:
"""
(Return code of 141 is out of bounds)
"""

When run by hand I have:
"""
root@ops-mon-nagios3 /usr/local/nagios/libexec $ ./check_http -H 
device.domain.com -w "10" -c "20" -S -p "83" -f follow
CRITICAL - Cannot make SSL connection
root@ops-mon-nagios3 /usr/local/nagios/libexec $ echo $?
141
"""

Anyone seen this before?  Is this resolved in nagios-plugins > 1.4.15?

Here's some potentially useful, lightly filtered strace output, showing 
it exiting on a SIGPIPE:
"""
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(83), 
sin_addr=inet_addr("68.232.133.59")}, 16) = 0
write(3, "\200w\1\3\1\0N\0\0\0 
\0\0009\0\0008\0\0005\0\0\26\0\0\23\0\0\n\7\0\300"..., 121) = -1 
ECONNRESET (Connection reset by peer)
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 5), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x2ab9e22dd000
write(1, "CRITICAL - Cannot make SSL conne"..., 39) = 39
write(3, "\200w\1\3\1\0N\0\0\0 
\0\0009\0\0008\0\0005\0\0\26\0\0\23\0\0\n\7\0\300"..., 121) = -1 EPIPE 
(Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
+++ killed by SIGPIPE +++



-- Mike Lindsey

-- 
Mike Lindsey


--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] configure receiving snmp traps

2012-09-07 Thread Mike Lindsey

On 9/5/12 1:00 AM, Marco Borsani wrote:


I read many docs, but I still have problem to configure nagios 3.x to 
receive the traps.


May someone explain the steps to follow to configure correctly this 
issue ?


Is it necessary other SW ?




You'll need to ensure that snmptrapd is enabled on your Nagios poller, 
and the typical route from there to get snmp traps submitted into Nagios 
is to install SNMPTT.


http://snmptt.sourceforge.net/

I recommend reading the docs for these, but, a very basic snmptrapd.conf 
would be:

## snmptrapd.conf
snmpTrapdAddr udp:localhost,udp:YOUR_IP_HERE,tcp:YOUR_IP_HERE

authCommunity log,execute public
logOption f/var/log/snmptrapd.log
traphandle default /usr/sbin/snmptt -i /usr/local/share/snmp/snmptt.ini
##

And then in the TrapFiles section of snmptt.ini you might have:
##
[TrapFiles]
snmptt_conf_files = <# All of these are stateless so the handler script needs to set and 
clear the service.

# The service entry must have 0 retries set and be volatile.
#
# .1.3.6.1.4.1.15497
#

# powerSupplyStatusChange
# Status: .1.3.6.1.4.1.15497.1.1.1.8.1.2
EVENT powerSupplyStatusChange .1.3.6.1.4.1.15497.1.1.2.0.2 "asyncos" 
Critical

FORMAT $N trap from $r
EXEC /usr/local/nagios/customplugins/submit_trap $r AsyncOS-Trap_Alert 
$s 0 "$N: $*"

#
#

Your submit_trap script takes that, and hands it off to Nagios.  You can 
submit through NSCA, or you can create a result file in the checkresult 
directory, or you can submit through the external command pipe.


I do it through NSCA:
# submit_trap
#!/usr/local/bin/bash

PATH=/bin:/usr/bin:/usr/local/bin:/usr/local/nagios/customplugins:/usr/local/nagios/bin
CONFIG=/usr/local/nagios/etc/send_nsca.cfg
NSCA=`hostname`

HOST=$1
SERVICE=$2
STATUS=$3
STATEFUL=$4
MESSAGE=$5
case $STATUS in
"Critical")
CODE=2
;;
"Warning")
CODE=1
;;
"Normal")
CODE=0
;;
*)
CODE=3
;;
esac

printf "%s\t%s\t%s\t%s\n" "$HOST" "$SERVICE" $CODE "$MESSAGE" | 
send_nsca -H $NSCA -c $CONFIG

if [[ "$STATEFUL" == "0" ]] && [[ "$STATUS" != "0" ]]
then
# Clear Nagios via delayed at now that the volatile ticket's gone 
through.
echo "/usr/local/nagios/customplugins/clear.sh $HOST \"$SERVICE\" 
\"$MESSAGE\"" | at now + 15 minutes


fi
#

...  and clear.sh for clearing stateless alerts.

#
#!/usr/local/bin/bash

PATH=/bin:/usr/bin:/usr/local/bin:/usr/local/nagios/bin:/usr/local/ironport/nagios/bin
HOST=$1
SVC=$2
OUT=$3

if [[ "$HOST" == "" ]] || [[ "$SVC" == "" ]]
then
echo "Need host, service, optional message."
exit 3
fi

# Clear it
printf "%b" "$HOST\t$SVC\t0\tWas:$OUT\n" | send_nsca -H `hostname` -c 
/usr/local/nagios/etc/send_nsca.cfg


fi
#

If you're using the auto-clear bits, your Nagios user will need to be 
able to add items to the at queue, you'll need to look at your 
distribution's documentation on how that's managed.  This is just one 
way of getting snmp traps working.  Unfortunately none of them that I 
know of overly straightforward.


Even if this doesn't work for you, it should give enough of an insight 
so that you've got a better idea on what to google for. Good luck.


--
Mike Lindsey

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Nagios 3.3.1, event brokers, and debug.

2012-02-26 Thread Mike Lindsey
On 2/24/12 2:26 AM, Andreas Ericsson wrote:
> I'd have to send you the new Nagios code and get Sven to help me patch 
> mod_gearman to avoid using threads, but if you want to give it a shot, 
> I'm sure we could have you up and running in notime. 
We're going with something else in the short term, probably updating our 
obsess commands to send data to multiple local servers, and pushing 
freshness checking off the master, onto those local failover nodes.  We 
finally reached the point where freshness checking on the master in 
Nevada, if there was a problem in Europe, could actually make the master 
fall far enough behind in processing passive checks.. causing cascading 
failures.

Since we have a corporate directive to "Go Linux" we're just taking this 
as one among *many* reasons to accelerate our migration plan.

-- 
Mike Lindsey


--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Nagios 3.3.1, event brokers, and debug.

2012-02-23 Thread Mike Lindsey
On 2/23/12 10:50 AM, Sven Nierlein wrote:
> On 2/23/12 19:33, Mike Lindsey wrote:
>> Turns out that's the problem.  I've rebuilt from source and it loads, 
>> now to get our package maintainer to rebuild the package.  And to 
>> figure out why mod_gearman_worker's children keep segfaulting. 
>
> Seems to be freebsd related. A colleague could reproduce that with 
> freebsd 8.
Any advice short of rebuilding my entire infrastructure?

-- 
Mike Lindsey


--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Nagios 3.3.1, event brokers, and debug.

2012-02-23 Thread Mike Lindsey
On 2/23/12 2:16 AM, Sven Nierlein wrote:
> Hi Mike,
>
> Please don't hijack other threads.
Apologies.  Unintentional thread header jacking.
> Make sure you have eventbroker handling compiled in. 
> (--enable-event-broker).
> Also consider using the latest stable 3.2.3 which has been 
> successfully tested with
> Mod-Gearman. I never tried the 3.3.1.

Turns out that's the problem.  I've rebuilt from source and it loads, 
now to get our package maintainer to rebuild the package.  And to figure 
out why mod_gearman_worker's children keep segfaulting.

It *looks* like gearman works fine with 3.3.1.  At the very least I see 
jobs going into the queue.

-- 
Mike Lindsey


--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Nagios 3.3.1, event brokers, and debug.

2012-02-22 Thread Mike Lindsey
I'm trying to test out mod_gearman, but I don't see any message about 
the event broker loading in the main logfile, and enabling debug logging 
just results in a blank debug log file.

 From nagios.cfg:
debug_file=/usr/local/nagios/var/nagios.debug
debug_level=66
debug_verbosity=2
max_debug_file_size=1000
event_broker_options=-1
broker_module=/usr/local/nagios/lib/mod_gearman.o 
config=/usr/local/nagios/etc/gearman.cfg

 From nagios.log:
1329961651] Successfully shutdown... (PID=79938)
[1329961654] Nagios 3.3.1 starting... (PID=81413)
[1329961654] Local time is Wed Feb 22 17:47:34 PST 2012
[1329961654] LOG VERSION: 2.0
[1329961655] Finished daemonizing... (New PID=81414)

nagios.debug is empty.

Any advice?

-- 
Mike Lindsey


--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] DNX dead?

2012-02-14 Thread Mike Lindsey
Is DNX officially a dead project?

Last post on the developer's list is from May of last year - and got no 
response.  Last thread is from two months before that, the last release 
is from two years ago, and the documentation is even older.

-- 
Mike Lindsey


--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Dynamically add/remove hosts on Nagios

2012-02-10 Thread Mike Lindsey

On 2/9/12 5:32 PM, Felipe Cecagno wrote:
The problem is that I want to add and remove instances dynamically, I 
don't want to manually modify hosts.cfg on the central each time I 
change my infrastructure. So my idea was that when a new instance gets 
up, it will send to Nagios something like (always using NSCA):


"localhost Server UP 0 "

I believe XI has a feature that does some automatic adding of 
hosts/services from passive checks.


To get it working in Core you need to go about it in a different way.  
You could potentially have an event handler mechanism that works it..  
Say have a "add-host" service on your master Nagios host.  So when a new 
host is added to a cluster you trigger a passive result for that check, 
that check's event handler kicks off and adds the configuration (using 
some generous templating) for that host to your master config, 
pre-caches your object cache and restarts.  You could also have a 
"del-host" service that does the reverse if you're feeling brave.


For our environment, I took yet another route.  Our CMDB provides an API 
where I can ask for all our "production updates www" hosts.


So to get auto-updating cluster level monitoring for my environment I 
have a host entry (trimmed for brevity, and such) like:


define host {
host_name   cluster-updates-www
alias   cluster-updates-www
address cluster-updates-www
hostgroups  All,cluster,updates,updates-www,www
check_command   cluster_ping
_ENVIRONMENTprod
_PRODUCTupdates
_PURPOSEwww
}

It doesn't matter for this, if that hostname is in DNS, as nothing 
actually queries it.  Don't need an ip address either, because nothing 
uses it.


Check command in this instance is:
define command {
command_namecluster_ping
command_line$USER5$/cluster_check.py --product $ARG1$ 
--purpose $ARG2$ --script '$USER1$/check_icmp' --args '-H %HOST% -c 
1800.0,100% -n 2 -t 2'

}

So the cluster_check.py asks the CMDB for a list of hosts when it first 
runs, then caches that list for an hour.  It pings all the hosts in 
parallel, sums up the stats and does the right thing.  Unfortunately 
this kind of process requires some external query-able source of truth.  
Our CMDB (internally developed, not currently releasable) provides a 
JSON dump on an http port..  If you've got anything you can query or 
parse - even an rsync'd dns zone file you should be able to cobble 
something together that works similarly via bash, perl, or whatever 
works for you.  Auto-updating cloud/cluster monitoring and no 
configuration updates or restart needed.  Same script and methodology 
works for services.


Doesn't help you, however if you absolutely must have unique host and 
service entries


I've attached my script if you want to rip it apart and use it for your 
environment.  You'll have to replace any of the bits that mention 'asdb' 
with code that queries and parses your CMDB, or whatever.  It's a good 
bit of effort, but potentially worth it.. good luck.  (Or, there's XI)


--
Mike Lindsey



cluster_check.py.gz
Description: GNU Zip compressed data
--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] nsca old server with new nsca client

2012-02-07 Thread Mike Lindsey
On 2/7/12 12:10 PM, Albert Shih wrote:
> Hi all,
> Is there any way I can use a new (2.9.1) client nsca (send_nsca) with old
> server (2.7.x) ?

2.9.1 shouldn't include any backwards incompatible code.  That said, the 
normal cross-version issues have been with newer server, and older 
client so I'm not sure "old server" has been sufficiently tested with 
"new client"...

Is there a particular reason why you can't upgrade your server side?

-- 
Mike Lindsey


--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Issue with distributed Host checks

2012-02-02 Thread Mike Lindsey
I'm seeing oddities with my host checks.  These are all on 3.2.1, and I 
do not have Host dependencies for the hosts in question.

A worker node will detect a host as being down and send back a soft 
passive result.

In many cases, the master will then immediately perform an active host 
check which is NOT logged.  That host check will result in a hard state 
change, even though host checks are set for 2 retries at 1 minute intervals.

Anyone know what's going on, or do I need to go read the source?

Here's the relevant entries from the master node's nagios.cfg:
$ grep host nagios.cfg
accept_passive_host_checks=1
cached_host_check_horizon=15
check_for_orphaned_hosts=0
check_host_freshness=0
enable_predictive_host_dependency_checks=1
execute_host_checks=0
global_host_event_handler=event_handler
high_host_flap_threshold=20.0
host_check_timeout=30
host_freshness_check_interval=60
host_inter_check_delay_method=s
host_perfdata_file=/usr/local/nagios/var/host-perfdata.dat
host_perfdata_file_mode=a
log_host_retries=1
low_host_flap_threshold=5.0
max_host_check_spread=30
obsess_over_hosts=0
passive_host_checks_are_soft=1
retained_contact_host_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
translate_passive_host_checks=0
use_aggressive_host_checking=0

And here's an example host object:
define host {
 host_name   
 address 
 hostgroups  
All,cres,cres-dbss,cres-prod-dbss,cres-prod-dbss.soma,dbss,linux2,soma
 check_command   check-host-alive
 max_check_attempts  2
 check_interval  3
 retry_interval  1
 active_checks_enabled1
 passive_checks_enabled   1
 check_period24x7
 obsess_over_host1
 check_freshness 1
 flap_detection_enabled   1
 process_perf_data   0
 retain_status_information 1
 retain_nonstatus_information  0
 contact_groups  sysops
 notifications_enabled 1
 notification_interval 60
 notification_period   24x7
 notification_options  d,u,r,f
 notes_url   
https:///cacti/graph_view.php?action=preview&host_id=0&graph_template_id=0&filter=
 action_url  /nagios/cgi-bin/extui.py?host=.com
 _ENVIRONMENTprod
 _HARDWARE   R710
 _LOCATION   soma
 _OS Linux 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 
EDT 2009 x86_64
 _PORTFOLIO  Encryption
 _PRODUCTcres
 _PURPOSEdbss
 _RACK   07--11
 _SERIAL 536QNM1
 _SOURCE ASDB/Servers
 _SOURCE_URL https:///servers/admin/servers/server/3363/
 __SNMP_COMMUNITY
}

-- 
Mike Lindsey


--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Opinions on load balancing and failover mechanisms

2012-01-25 Thread Mike Lindsey
There are a lot of options..  DNX, Merlin, mod_gearman to name a few...  
I could read the docs (and have read a good portion of some of them) and 
could implement test environments (and will eventually need to) but 
first I want opinions from people who've done this at large scale.

I need to improve on our load distribution and failover mechanisms.  
Right now worker node outages are handled through freshness checking, 
and master node outages are handled through a load balanced vip and some 
fancy cron jobs that kick up a cold spare.

What are the better options for local load distribution and geographic 
master failover?  Which options will better handle thousands of servers 
across a dozen colos, in half a dozen countries, when the goal is that 
no single host (or colo!) going offline can be allowed to have an effect 
on any other subset of the infrastructure?  Which options should I avoid?

Currently running Nagios Core 3.2.1 with NSCA 2.9 on mostly FreeBSD 
systems.  Soon that should be Core 3.3, with XI on top, plus whatever 
load distribution mechanism wins the dog fight.

-- 
Mike Lindsey


--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] how to avoid host dependency with check_cluster

2012-01-13 Thread Mike Lindsey
On 1/13/12 2:44 AM, Morty wrote:
> I've gotten check_cluster working to monitor a service cluster per the
> docs.  It works at a basic level.  Thanks!
>
> Problem: service definitions seem to require an associated host or
> hostgroup.  I don't want to tie the check_cluster to an individual
> host, because if that host goes down, a different host in the cluster
> could still be up.  But I also don't want it tied to every host in the
> cluster because then I could get duplicate notifications.
>
> What am I missing?
>
The easy solution.  Add a host entry called "WWW-Cluster1" or whatever 
you want it to be called..  Set the ip address to 127.0.0.1, and your 
host up/down check to some always-alive check ("echo OK" will do fine).

Then attach your cluster services to that.  You could even have your 
host availability check be a check_cluster command that pings all your 
hosts.

-- 
Mike Lindsey


--
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] How to route data from multiple nagios core nodes to a nagiosxi node?

2011-11-10 Thread Mike Lindsey
On 11/8/11 6:06 PM, Benjamin wrote:
> I have about ten nagios core machines that I currently monitor 
> collectively using MNTOS. Is there a way to feed the data from my 
> nagios core machines to a nagiosXI machine that makes it possible to 
> use the nagiosXI features like the visualization/dashboards/reporting 
> for the services/hosts being monitored by the nagios core nodes? I 
> basically want to replace MNTOS w/ a nagiosXI machine — so that I can 
> utilize its features as a dashboard and reporting node for all the 
> data I receive at each nagios core server.
>
> Basically, I want to set up a hub and spoke w/ a nagiosXI machine at 
> the hub and all my nagios core boxes as spokes.
> I looked into DNX but this looks like it distributes the checks in a 
> different way. Any ideas?

I'm not as familiar with XI as I'd like, but based on the demos and the 
docs, I think NSCA will do just fine. I believe the configuration would 
be exactly the same as with a normal Nagios Core install.

-- 
Mike Lindsey


--
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] nagios 2.9 doesn't send emails anymore.

2011-10-27 Thread Mike Lindsey
On 10/27/11 3:15 AM, Mario Garcia Ortiz wrote:
> Hello
> we have this strange issue on a nagios server running version 2.9.
>
> all of a sudden, we stopped receiving notifications
> the web interface shows that the notifications are sent to all the
> contacts but nothing is actually sent. there's nothing on the syslog
> of the server; if we send manually a mail via the command line that is
> sent but nothing that is sent by nagios process itself.
>
> what could be the problem here.
> thank you
>
Many things could be the problem here, only some of them would be "Nagios."

If you were running 3, I'd say turn on debug logging.  Since you're not, 
this gets hard (or you could just upgrade to 3, see if the problem 
disappears, and if it doesn't, turn on that debug log.)

What's your notification command config look like?

What happens if you add:
"""
 >/tmp/notif_log.out 2>&1
"""
to the end of it?  That should trap the command stdout and stderr, and 
save it in that file.

-- 
Mike Lindsey


--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Question about Nagios Features

2011-10-27 Thread Mike Lindsey
nalysis - Be able to provide analysis on traffic 
> flow as based on NetFlow, SFlow and/or JFlow
> 29. Netflow, SFlow, JFlow Support - Be able to support 
> devices/elements running IP Flows and display statistics as based on 
> information
> 30. Netflow Reporting - Be able to provide Flow reports
> 31. Application Traffic Information - Be able to provide statistics as 
> based on application traffic information
> 32. Demographics - Be able to provide information on top users and top 
> applications
> 33. VoIP QoS Measurement - Be able to provide statistics as based on 
> VoIP QoS measured values
> 34. Alerts, Grahps and Reporting - Be able to provide alerts, graphs 
> and reports as based on statistics gathered
> 35. VoIP Infrastructure Monitoring - Be able to monitor VoIP network 
> infrastructure and provide information on network health for support 
> of VoIP service
>

-- 
Mike Lindsey


--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] escalations question

2011-10-26 Thread Mike Lindsey
On 10/26/11 10:27 AM, Paul M. Dubuc wrote:
> Michael Barrett wrote:
>> Is there anyway to get that sort of setup working btw?
> You might re-think why you want to do this.  If there has been a problem at
> the warning level for 2 or more notification intervals without it being
> acknowledged (which stops notifications) or fixed, maybe your secondary
> contact should be notified anyway when the critical threshold is exceeded.
When you have multiple levels of management in your escalation trees, 
this particular kind of behavior is to be avoided at all cost.  :)
> If you really want it to work the way you describe then the best solution I
> can think of is to have 2 separate services with different contacts.  One that
> issues only warnings and the other only critical problems.  But then you've
> doubled the number of checks you are doing for the same problem.
>
There's a split-tier notification patch that seems to handle this pretty 
well.  Standard escalation configuration stanzas work the same, but a 
few new ones are added that allow discrete escalations based on 
notification number AND type.  Barring that you'll have to handle it 
some monkey-patched after-thought, in your notification scripts.

You can search the forums (or google) for the patch.  If that fails, I 
can probably find it later.  I believe it was compatible with Nagios 3.2.

-- 
Mike Lindsey


--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Notifications for Services that are "Up"

2011-09-02 Thread Mike Lindsey

On 9/2/11 7:59 AM, Michael Loiselle wrote:
I am currently running Nagios 3.3.1 on Ubuntu 10.04 and everything is 
working great.  I am monitoring 30 Windows servers with NSClient++, 
and that is working as well.  Is it possible to receive a notification 
for a specific service that is in the "up" position?  In other words, 
I would like to get a notification every 4 hours, confirming that a 
service is actually running.  If it stops on Friday evening, I do not 
want to wait until Monday morning to find out it is not running.  I 
would personally rather receive a message every four hours.  
Notifications are currently set up and working flawlessly, so I just 
need to know what I need to change in the config files to get this to 
work.  Any help is appreciated.
Not entirely sure what you're trying to solve here?  Generally it is 
good if you can get to a place where you trust your monitoring system.  
It should be telling you if something has broken, but if something 
continues to run well, the monitoring system should shut up and not 
bother you.


If it stops on Friday evening, you should get an email immediately (or 
as soon as Nagios notices it...)  If by "it stops" you mean "Nagios 
stops" then that's a separate problem - I have a secondary system that 
ONLY monitors my primary Nagios infrastructure.  If the primary system 
fails, the secondary system emails me - well, pages me, my secondary, 
all of ProdOps, etc.


TL;DR:  No easy way to have an OK service automatically email.  Nagios 
makes noise when things are broken, not when they're working.


--
Mike Lindsey

--
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

[Nagios-users] extraneous data in status file, for custom macro

2011-08-30 Thread Mike Lindsey
I'm seeing odd data in my status file, for custom macros:

 _ENVIRONMENT=0;prod
 _PORTFOLIO=0;Internal
 _PRODUCT=0;monitoring
 _PURPOSE=0;extnagios
 _SOURCE=0;ASDB/Rolemaps

Where are the 0; bits coming from, and what do they signify?

-- 
Mike Lindsey


--
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Having users view only their hosts/services

2011-08-18 Thread Mike Lindsey

On 8/17/11 10:55 AM, Edwin Zoeller wrote:


I know that this question has been posted many times before but what I 
am looking for is where I went wrong and if someone has an easy method.


I have setup various users to view there information in Nagios and all 
is good. But for some reason, when I setup users now, it gets them to 
login but displays the error message the "not authorized to view..."


I have no clue what I have done wrong. Any help would be great.


Your users can only see services and hosts for which they are contacts.  
This means that your login names must be the same as your contact names.


http://nagios.sourceforge.net/docs/3_0/cgiauth.html

--
Mike Lindsey

--
Get a FREE DOWNLOAD! and learn more about uberSVN rich system, 
user administration capabilities and model configuration. Take 
the hassle out of deploying and managing Subversion and the 
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Nagios authentication thru LDAP.

2011-08-10 Thread Mike Lindsey
On 8/10/11 9:23 AM, Robert J Molerio wrote:
> Can anyone indicate how this can be done?
> We would like users to log on to Nagios via LDAP.
> I think we need to configure the Apache server within Nagios to be 
> able to do this but we're not sure.

Depending on your version of Apache this ranges from a pain in the rear, 
to nigh impossible.  It's doable, but I've often found it easier and 
more stable, to have a cronjob that exports the ldap users to an 
htpasswd file.  Requires fewer changes to your Apache installation, and 
doesn't lock your users out of your Nagios install if LDAP fails.

-- 
Mike Lindsey


--
uberSVN's rich system and user administration capabilities and model 
configuration take the hassle out of deploying and managing Subversion and 
the tools developers use with it. Learn more about uberSVN and get a free 
download at:  http://p.sf.net/sfu/wandisco-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Eternally pending, stale checks

2011-08-04 Thread Mike Lindsey
I deployed new monitoring today, and despite a few restarts and many 
hours of waiting, 185/220 services are still pending.

It's a 3.2.1 environment (yes, yes, upgrade, yes) with one master and 
multiple pollers.  All this new monitoring is on one polling host.  
Active checks are disabled on the master, passive checks are submitted 
via NSCA.  Freshness threshold is set to 20 minutes for checks with a 5 
minute interval.

The polling host executes the checks, has the right data in the 
status.log, but the master never receives some of the check data.

The data it does receive is not consistently grouped.  Service A on one 
host will submit consistently, but the same service on a different host 
will fail to submit.  The master will, every 20 minutes throw messages 
about the checks being stale, and needing to force an immediate check, 
but that never seems to make it's way through.

My next step, I suppose will be enabling debug mode on the master, but 
if history is any indication, that will cause the problem to stop 
happening - in addition to it being a pain to parse through debug logs 
for a 10k service environment.  If anyone has ideas on what else to 
check, I'm ears.

-- 
Mike Lindsey


--
BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA
The must-attend event for mobile developers. Connect with experts. 
Get tools for creating Super Apps. See the latest technologies.
Sessions, hands-on labs, demos & much more. Register early & save!
http://p.sf.net/sfu/rim-blackberry-1
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Attempting to execute bash script stored in Service Meta variable, referenced in command_line

2011-07-07 Thread Mike Lindsey
Uncleaned macro.  Running output (237): 'bash -c 'mailflow_rate.py -H 
xxhostxx -u user -p  -s "1200" -d "db" -t "table" -w 
"$NAGIOS__SERVICEWARN" -c "$NAGIOS__SERVICECRIT'
Not currently in macro.  Running output (244): 'bash -c 
'mailflow_rate.py -H xxhostxx -u user -p  -s "1200" -d "db" -t 
"table" -w "$NAGIOS__SERVICEWARN" -c "$NAGIOS__SERVICECRIT" 2>&1''
Done.  Final output: 'bash -c 'mailflow_rate.py -H xxhostxx -u user -p 
 -s "1200" -d "db" -t "table" -w "$NAGIOS__SERVICEWARN" -c 
"$NAGIOS__SERVICECRIT" 2>&1''

Short Output: Usage: mailflow_rate.py [options]
Long Output:  \nmailflow_rate.py: error: option -w: invalid integer 
value: '$((/path/_slice_threshold.sh WARN))'\n



So, the final command has the right environment macros, and it does pull 
the string I expect out of the first service meta variable, but bash 
doesn't do the command substitution, just hands the unsubstituted string 
off to the poller script.

Any ideas on how I can get this working right?

--
Mike Lindsey


--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Expanding Custom Variables

2011-06-29 Thread Mike Lindsey
On 6/29/11 12:18 PM, Stringham, Steven wrote:
 > I am trying to monitor multiple volumes on a NetApp system. The 
format of the command requires a
 > hostname:volumename format. I want to reduce my commands/service 
definitions to a minimum.  My initial
 > thought was to have a generic service definition, that gets more 
specific with a sub definition.  When the
 > command is run, it seems like it is not passing the custom variable, 
but rather leaving a single $ behind where
 > the variable ought to be.

I'm not sure that custom macros are evaluated at the command level?  
Perhaps set your command_line to pull in the variable from the service:

define service {
 name NA_SnapMirror
 check_command netapp_snapmirror!$_SERVICEnavolume$
 use  GenericService_Core
 normal_check_interval 1000
 max_check_attempts 300
 register 0
 contact_groups CoreServers
}

define service {
 use NA_SnapMirror
 _navolume myvolumename
 service_description SnapMirror_groups
 hosts myhostname
}

define command {
 command_name  netapp_snapmirror
 command_line $USER1$/check_naf.py -H $HOSTADDRESS$ -C $USER8$  
snapmirror,$HOSTNAME$:$ARG1$,$USER25$
}

...
Alternately, if you have enable_environment_macros=1 in nagios.cfg, you 
could instead put "$NAGIOS__SERVICEnavolume" and pass the reference to 
the script.

One of the two should work for you.  If not, then I'd recommend 
restarting in debug mode, debug_level=18 will get you debug information 
about both the configuration load process, and the service check 
execution, so you should be able to figure out the problem - just fire 
it up in a reduced config set, so you only have this in there and don't 
get spammed by normal operations.

What version, btw?

--
Mike Lindsey

-- 
Mike Lindsey


--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Host acknowledgments not respecting "persistent"

2011-06-16 Thread Mike Lindsey
Host acks are being cleared on restart with 3.2.1, checking the 
changelog, I don't see any fixes for this.

The command ending up in the log is:
[1308256491] EXTERNAL COMMAND: 
ACKNOWLEDGE_HOST_PROBLEM;xxx;2;0;1;Mike Lindsey;testing restarts

It's interesting that the sticky bit is being set to '2'...  But from 
looking at the code, it seems like it just tests for boolean, so that 
should be fine.  Persistent is set to 1, but on a restart the host is no 
longer acked.

Has this been silently fixed in 3.2.2, or 3.2.3?  Is it on the roadmap 
for 3.2.4?

-- 
Mike Lindsey


--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] CMDB backend and feeder mechanism?

2011-05-17 Thread Mike Lindsey
I'm interesting in hearing what kind of CMDB people tend to use, who 
also use Nagios.

How do you track and maintain your hosts, how do you map that into your 
Nagios configuration?

I'm particularly interested in those with particularly complex environments.

We have an internally developed CMDB, as an internally developed Nagios 
configuration management tool, that I'm working on getting final 
approval to release as open source.  I'd love a chance to pick some 
brains about what works, and particularly what doesn't work.

-- 
Mike Lindsey


--
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its 
next-generation tools to help Windows* and Linux* C/C++ and Fortran 
developers boost performance applications - including clusters. 
http://p.sf.net/sfu/intel-dev2devmay
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] acknowledge triggers a script

2011-05-10 Thread Mike Lindsey
On 5/10/11 12:20 PM, dave stern - e-mail.pluribus.unum wrote:
> We have an interesting need. When a particular service goes red on our
> Nagios 3.2.1 server, we'd like to be able to click on "Acknowledge this
> service problem" and have that activate a local script. Anyone have any
> idea how this can be accomplished?
Add a secondary CGI (bash script, perl, python, etc) and link to that using the 
ACTION_URL config variable for your hosts/service.  Then you can have a link 
there that submits the ack command as well as executes your secondary command.

If you don't want that intermediary step you're going to need to update the cgi 
source.

-- 
Mike Lindsey


--
Achieve unprecedented app performance and reliability
What every C/C++ and Fortran developer should know.
Learn how Intel has extended the reach of its next-generation tools
to help boost performance applications - inlcuding clusters.
http://p.sf.net/sfu/intel-dev2devmay
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] enable SNMP trap handling in Nagios

2011-05-10 Thread Mike Lindsey

On 5/10/11 9:46 AM, khurram aziz wrote:

Hi,

i am using Nagios 3.2.3 & want to enable SNMP Trap Handling so that I 
can check uptime of my servers ( snmp service has already been enabled 
on the servers).


can sum1 help me with the configuration.

Well, first off you don't need snmp traps to check uptime.

$ /usr/local/nagios/libexec/check_snmp -H localhost -o 
.1.3.6.1.2.1.1.3.0 -C xx

SNMP OK - Timeticks: (362041566) 41 days, 21:40:15.66 |
$

If all you want to see is an uptime counter, add a service that does 
that.  Replace localhost with $HOSTADDRESS$, and replace xx with 
your snmp community.  Unfortunately, check_snmp doesn't seem to support 
having warning or critical thresholds, so unless snmpd is down, that 
will always return ok.  You can use snmpget to get the raw timeticks:


$ snmpget -Ovt -v2c -c x localhost .1.3.6.1.2.1.1.3.0
36207797
$

If you want a critical alert every time a box has rebooted, write a 
shell script that calls that snmpget command, passing in the host 
address and snmp community via the command line.  Of course, that will 
throw an Unknown while the box is actually down (snmp can't tell if the 
host is down, or if you've passed the wrong snmp community.)


If what you really want is to know when your box is down, use check_ssh:
$ /usr/local/nagios/libexec/check_ssh localhost
SSH OK - OpenSSH_5.1p1 FreeBSD-20080901 (protocol 2.0)
$

That will throw a critical any time it can't connect, or if it can 
connect but the ssh version string isn't found.


If you still want to use snmp traps, here's a link to some lovely 
documentation:


http://snmptt.sourceforge.net/docs/snmptt.shtml

There's even a section on integrating with Nagios, though I suggest you 
get some coffee and a snack and read the whole page.


--
Mike Lindsey

--
Achieve unprecedented app performance and reliability
What every C/C++ and Fortran developer should know.
Learn how Intel has extended the reach of its next-generation tools
to help boost performance applications - inlcuding clusters.
http://p.sf.net/sfu/intel-dev2devmay___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Service failover dependency

2011-04-07 Thread Mike Lindsey
On 4/7/11 2:45 AM, Andrey Mitroshin wrote:
> I'm afraid I did not explained my problem clearly.
>
> So, I've got 2 servers.
> serverA - primary (10.0.0.1)
> serverB - backup (10.1.0.1)
>
> There is some apache vhost named www.site.com configured on both of them.
> My failover is supposed to work as follows.
>
> Usually serverA is up and www.site.com resolves to 10.0.0.1.
> when serverA.com goes down, nagios executes evenhandler, updates A
> record  and www.site.com points to 10.1.0.1 (serverB).
>
> The problem arises when both servers are down. So, evenhandler updates
> A record of www.sites.com, but serverB is down as well
>
> My goal is to avoid executing eventhandler when serverB is down.
> And the question is how to configure such a behaviour in nagios.
What you need here, is a smarter event handler.  In the event handler 
script, have it test serverB.  If serverB is up, update the A record; 
otherwise just exit (and maybe update a logfile somewhere).

-- 
Mike Lindsey


--
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Check_APC_PDU Command Definition

2011-04-05 Thread Mike Lindsey
Often, when you're getting an error and the only result you see is 
(null), what is happening is that your check script is printing to 
stderr.  It might be that you have perl in your path, but the perl 
script's #! line doesn't declare the full path to perl, or there's an 
access error of some sort.


But it's easy to figure out what's going on.  Simply change your 
command_line to:
command_line$USER1$/check_apc_pdu.pl -H $HOSTADDRESS$ -C public 
2>&1


That will redirect standard error to standard out.  Next time Nagios 
runs the script it will capture the full output of the script and you 
should see right in your Nagios ui, what the issue is.


Sun, Mar 27, 2011 at 2:45 PM, Peter Roddan 
mailto:peter.rod...@sbsworldwide.com>> 
wrote:
If I log onto the nagios server as the nagios user, and run the 
command from the libexec folder (check_apc_pdu --H  -C 
public) I get the response :


"OK: All Outlets ok. | load=25"

I have put the following command definition in :

# 'check_apc_pdu' command definition

define command{

command_name  check_apc_pdu

command_line $USER1$/check_apc_pdu.pl <http://check_apc_pdu.pl> -H 
$HOSTADDRESS$ -C public


}

And defined the following service

define service{

use 
generic-service; Inherit values from a template


hostgroup_name apc 
; The name of the host the service is associated with


service_description
check_apc   ; The service description


check_command  
check_apc_pdu ; The command used to monitor the service


normal_check_interval 5  ; 
Check the service every 5 minutes under normal conditions


retry_check_interval  
1  ; Re-check the service every minute 
until its final/hard state is determined


}

However, my APC PDUs report an error for this service, with  a status 
information of "(Null)"


I'd be grateful for anyone who could point me in the right direction 
of where I'm going wrong.





--
Mike Lindsey

--
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

[Nagios-users] Long running notification script

2011-03-17 Thread Mike Lindsey
I have a notification command that will typically take longer to run, 
than my notification timeout.  I don't particularly care, if Nagios gets 
a valid return code back, so I set the main script to fork twice, with 
the initial process printing 'OK' and exiting with a return code of 0.  
The child process also exits immediately with a return code of 0, while 
the grandchild hangs around to do some heavy lifting.

I was hoping that the double-fork would keep Nagios from blocking on the 
process, but the debug logs are still showing:
[1300401208.452280] [032.1] [pid=55343] Adding normal contacts for 
service to notification list.
[1300401239.455867] [032.0] [pid=55343] 1 contacts were notified.  Next 
possible notification time: Fri Mar 18 03:33:28 2011

When I'm expecting the '1 contacts were notified' to happen pretty much 
immediately.

Any ideas to get around this, other than writing out a spool file and 
having a secondary daemon handle the heavy lifting?

-- 
Mike Lindsey


--
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] ServiceGroup Best Practices Question

2011-03-10 Thread Mike Lindsey

On 3/10/11 12:25 PM, David Harbaugh wrote:


I'm new to Nagios.  Running Nagios 3.2.3.

I want to start using Service Groups, but I'm not sure of the best 
place to put the service group definitions.


What is making me question the location is eventually I will want to 
create a service group that contains services hosted on both Linux and 
Windows machines, so I'm thinking of creating a new config file to 
hold the service groups, then in nagios.cfg use cfg_file= to load it 
in after the Windows and Linux machines are loaded.


Where do you create service groups?

I put all my service groups in 'service_groups.cfg'.  The order in which 
you specify them in nagios.cfg does not matter.


--
Mike Lindsey

--
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] nagios server redundancy

2011-02-11 Thread Mike Lindsey
On 2/11/11 10:26 AM, Morty wrote:
> I'm looking to implement redundant nagios servers, with the backup
> server in a different location than the prime server.  This is nagios
> 3.2.3, with the default web interface.  I'm synchronizing
> configurations by rsyncing /usr/local/nagios/etc/ between systems.
> I'm doing active/active (i.e. I want the backup server monitoring at
> the same time as the prime server.)  So far so good.
>
> Problem: acknowledgements on the prime are not being synced to the
> backup.
>
> Is there a (clean) way to sync the prime's acknowledgements to the
> backup, as well?  I'm tempted to shut down the backup, rsync the
> prime's var directory to the backup, and then bring the backup back
> online.  But the docs have various warnings about not messing with the
> var files, so figured I'd ask about possible hidden gotchas.
>
> I've read http://nagios.sourceforge.net/docs/3_0/redundancy.html, but
> scenario one doesn't discuss syncing acknowledgements, and scenario 2
> is active/passive.
What I end up doing with my backup master is leave it off, with frequent 
rsyncs of both config and the status files in var.

Both the active master and the backup master are sitting behind a load 
balanced vip, with the nsca and http/https ports managed by the load 
balancer.  There's a cronjob running on the backup master that, if it 
determines an error on the active master, starts up nsca, nagios, and 
apache.  That causes the vip to fail over to the backup master, giving 
automatic recover with no more than five minutes of downtime (the 
frequency of the cronjob).

The active master does not have apache, nsca, or nagios configured to 
start on boot, instead those are also managed by a cronjob that does a 
check of the backup master.  If the backup master is running 
apache/nagios/nsca, then the active master doesn't start up (and if 
they're already running, say from an intermittent error, they shut down) 
and the rsyncs also don't happen.  This allows me to do automatic 
failover, and manual fail-back, after whatever issue triggered the 
failover has been verified and resolved.

You cannot - to the best of my knowledge - sync acknowledgments to a 
backup server while it's actively running, unless you want to write 
something that checks for new acks and dumps them into the command 
pipe.  So, if you want to maintain acks and downtime, you'll need to 
have your backup disabled for the syncs.

-- 
Mike Lindsey


--
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Nagios storing data into PostgreSQL?

2011-02-08 Thread Mike Lindsey
Another potential option is setting ups a 'postgres' contact, with a 
custom notification command.  Pass the data you want to store, to the 
notification command, and let that dump the data to the database.


This can potentially be as simple as a bash script that takes the input, 
builds a sql statement and echos it to the postgresql client binary.


On 2/8/11 3:13 PM, Michael Friedrich wrote:

On 07.02.2011 22:49, larry johnson wrote:

Hello, i am newbie to linux/Nagios and need help to clear some doubts.
I wander whether is possible to make Nagios writing notification 
(host up/host down, for example) into PostgreSQL database?


if you write your own NEB broker module, and put that onto libpq or 
similar, the core will be able to, sure. Or you'll have a look at 
Icinga IDOUtils which support Postgresql quite a while now.


I found NDOUtils, but this addon does not suit me because i don't use 
MySQL.


Well there aren't that much alternatives to that. Merlin supports 
MySQL and Oracle (in development on git). I'm not sure if the Centreon 
Broker is already released which *should* support more RDBMS.


But for notifications only, why not using event handlers? then you 
could call scripts putting data into your rdbms the preferred way.

http://nagios.sourceforge.net/docs/3_0/eventhandlers.html

kind regards,
Michael

Also found that this kind of storage is suported under Nagios 1.x, 
but what about 3.x?

I run Nagios 3.2.3 (with 1.4.15 plugins) on openSUSE 11.3.

Regards.//

Mike Lindsey

--
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Check behavior during the notification event

2011-01-06 Thread Mike Lindsey
On 1/5/11 11:27 PM, Yu Watanabe wrote:
> Thank you for the reply.
>
> I understood that notification events will hang up the normal service check 
> events.
>
> I was bit curious about your comment.
>
>> A lot of people end up writing external notification handlers to take the 
>> load off of Nagios so the scheduled checks can continue whilst the external 
>> app 
> queues and processes the notifications.
>
>   If you could share your knowledge it would be helpful. 
>   Does people create external application that scans the nagios.log without 
> using 
>   any of Event Handlers or Notification Events of Nagios? Or perhaps event 
> broker?
>
What I ended up doing was having notification commands that drop a spool
file that contains the Nagios environment macros, into a directory. A
second daemon reads in the spool files for the notification event,
collects all the meta data (product dependencies, runbook links, ticket
links, etc), caches it for the subsequent contacts.

The spool file write is very quick, letting Nagios get back to dealing
with check submission handling, and the secondary daemon takes a serial,
blocking process, and does all the heavy lifting for notification
generation and email in a fairly parallel process. A notification for us
will include upwards of 20 contacts (some email lists, some individuals,
some ticketing and tracking systems, and some pagers).. At the height of
the "bad times" Nagios would block for 40 or so seconds, sending out
every single notification serially. Just dumping out a spool file for
each contact happens in a small fraction of a second. Pager contacts are
still straight Nagios to postfix, because simple is better for anything
where you're actually waking someone up at 3am.

-- 
Mike Lindsey


--
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Missing notifications?

2011-01-03 Thread Mike Lindsey

On 1/3/11 11:53 AM, James Moseley wrote:


I was directed in the irc forum to look at
http://nagios.sourceforge.net/docs/3_0/statetypes.html and I learned
that services or hosts need to have a hard state failure, in order for
notifications to go out. I've been trying to reboot my servers in
order
to get my notifications to go out. I'd like to check that I'll
actually
get notifications as soon as there is a problem, for example, a host
goes offline, I'd like to know after about 2 minutes if possible!


Then just turn a server off and wait a bit...  ;-)  You could also 
create a fake host, one with an unpingable IP address.
My favorite trick, that doesn't require adding fake config or rebooting 
a server, is updating /etc/hosts on the monitoring host to point a real 
host at a known bad ip address.  Requires root, and that you use dns for 
your hostaddress entries, however.


--
Mike Lindsey

--
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] NAGIOS_ environment variables in a notification script

2010-12-22 Thread Mike Lindsey
On 12/22/10 6:17 AM, Marc Haber wrote:
> Despite having set enable_environment_macros=1 in my nagios.cfg, the
> notification script only sees NAGIOS_PLUGIN=/path/bin/notify.
>
> What am I doing wrong?
>
> I'm using Nagios 3.0.6 from Debian lenny. Any hints will be appreciated.
enable_environment_macros should override use_large_installation_tweaks, 
which is what can also disable environment macros.  Perhaps your version 
is not acting as suspected?  See if you have u_l_i_t enabled, and if so, 
try disabling it.

If that isn't it, try setting debug_level=2 (and debug_file, etc).  
Restart and check the debug output to see if it's actually seeing the 
config directive.  Perhaps you have a typo.

Then maybe set debug_level=32 and run a few notification tests (or just 
set it to 34 initially so you get notification and configuration 
debugging)...

Also, consider upgrading.  Nagios 3.2+ is great.

-- 
Mike Lindsey


--
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Nagios kept from restarting after reboot by lock file

2010-12-20 Thread Mike Lindsey
On 12/20/10 8:16 AM, eric.b...@barclayscapital.com wrote:
> Alternatively, could you recommend a good system/resource monitoring tool 
> that would be able to let me know if nagios is down and restart it 
> automatically?
>
Add a cronjob on a five (or whatever you're comfortable with) minute 
interval, similar to:
#!/bin/bash

PATH=/bin:/usr/bin:/usr/local/bin
PID=`cat /home/nagios/nagios/var/nagios.lock`
PIDTEST=`kill -0 ${PID} 2>&1 >/dev/null`

if [ "${PIDTEST}" -eq "1" ]
then
 rm /home/nagios/nagios/var/nagios.lock
 # INSERT RESTART COMMAND HERE
 echo "Killed Lockfile and restarted Nagios" | mail -s "Nagios 
restart `hostname`" your-em...@here.com
fi
 >>>

Just be aware that it'll also trigger that if block, if nagios is 
running under a different username.  You can check for that by doing 
some tests in the script with ps and grep.

> _
> From:   Berg, Eric: IT (NYK)
> Sent:   Monday, December 20, 2010 11:03 AM
> To: 'nagios-users@lists.sourceforge.net'
> Subject:Nagios kept from restarting after reboot by lock file
>
> Gee, this seems like an annoying newbie problem, but if Nagios crashes or is 
> killed (as on system reboot), it leaves a lock file around that prevents it 
> from starting again until the lock file is manually removed.
>
> I see this on Monday mornings after weekend reboots on a Red Hat Linux box:
>
> nagios: Lockfile '/home/nagios/nagios/var/nagios.lock' looks like its already 
> held by another instance of Nagios (PID 0).  Bailing out...
Sounds like something in the shutdown process is throwing a 0 into the 
pid file, or the startup in the rc script is.

Either way, you should never have a 0 in there, either the rc script is 
putting the wrong data in there, or it's reporting incorrectly.
> Does anyone know if there's a config option or something else that obviates 
> the need to write a wrapper scropt to check to see if Nagios is really 
> running and remove the lock file (look slike Nagios already knows it's not 
> running by virtue of the value of the PID inthis very message!) so that it 
> can cleanly start up again?

-- 
Mike Lindsey


--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] CPU monitor for a single Linux user space process ?

2010-12-15 Thread Mike Lindsey
On 12/15/10 5:05 PM, Bruce Edge wrote:
> Rookie question here. Trying to determine nagios suitability for an
> embedded app.
>
> Can I monitor the CPU utilization for a single user space process on a
> Linux box with nagios?
> And, can I define an action if it exceeds a threshold?

Sounds like you need check_snmp_process.pl from here:
http://nagios.manubulon.com/snmp_process.html

I've been using it, it works quite well.  It requires snmpd, but is 
basically the swiss army knife of user-space process monitoring.

To "define an action" you need to set up an event handler.
http://nagios.sourceforge.net/docs/3_0/eventhandlers.html

-- 
Mike Lindsey


--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] 3.2.1 and non-incrementing notification count?

2010-08-15 Thread Mike Lindsey
Had bad source.  Recompiled and it's gone away.

On Aug 15, 2010, at 5:37 PM, Mike Lindsey wrote:

> I just migrated to a 3.2.1 instance, from a 3.0.6 instance, with the same 
> configuration.  Now I have some UNKNOWN results that are generating a 
> notification every five minutes.  The notification interval on the 
> services is 720 minutes, and same for the service escalation.
> 
> Every five minutes I'm getting a new email from the services, and the 
> notification counts in the ui still reads '0'.
> 
> Anyone seen this before?
> 
> --
> Mike Lindsey
> 
> --
> This SF.net email is sponsored by 
> 
> Make an app they can't live without
> Enter the BlackBerry Developer Challenge
> http://p.sf.net/sfu/RIM-dev2dev 
> ___
> Nagios-users mailing list
> Nagios-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting 
> any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null


--
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] 3.2.1 and non-incrementing notification count?

2010-08-15 Thread Mike Lindsey
I just migrated to a 3.2.1 instance, from a 3.0.6 instance, with the same 
configuration.  Now I have some UNKNOWN results that are generating a 
notification every five minutes.  The notification interval on the 
services is 720 minutes, and same for the service escalation.

Every five minutes I'm getting a new email from the services, and the 
notification counts in the ui still reads '0'.

Anyone seen this before?

--
Mike Lindsey

--
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Escalate after X warnings or criticals

2010-06-15 Thread Mike Lindsey
If it hasn't, I'll be adding it myself and will be happy to submit my 
patches back.  I've been needing this functionality for awhile, and was 
planning on rolling it in, in the next 2-3 months.

Andrew Li wrote:
> Does anyone know if the notification count problem got fixed in 3.2.1?
> 
> I had a read of the ChangeLog but it doesn't mention anything related to
> this problem since 3.0.6.


-- 
Mike Lindsey

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] extra "checkresults" files being left behind

2010-06-09 Thread Mike Lindsey
Mathew Walker wrote:
> I'm running Nagios on a little VPS box checking a few hosts/services 
> (~50 checks).  It's mostly a testing platform for me and checks in on my 
> other test VPS systems.
>  
> However I keep seeing the extra check results data files build up in 
> /usr/local/nagios/var/spool/checkresults like:
> -rw--- 1 nagios nagios249 Jun  7 23:45 checknbu01O
> -rw--- 1 nagios nagios252 Jun  8 02:40 checkHxcsiJ
>
> Googled a bit and didn't come up with much relevant.  Any thoughts?

If I remember correctly, the parent nagios process writes out that file, 
then forks a child.  The child then runs the check, updates that file 
and then creates a file with the same name, plus '.ok' in that 
directory, letting the parent process know the check is completed.

So, take a look at the contents of several of those files, if you're 
lucky, you'll see that either they are for the same host, or the same 
service check.  If so, there might be something in the way that host or 
service is getting polled that is causing the forked child to die.

Also, if you're running a version older than 3.0rc1 (generally always a 
good thing to include the version of the tool you're useing, when asking 
for help) then you may want to upgrade, that version fixed a bug that 
might be related:  "Fixed bug with not deleting old check result files 
that contained results for invalid host/service"

-- 
Mike Lindsey

--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Escalations - Warning to Critical, without skipping?

2010-05-19 Thread Mike Lindsey
So, here's my situation.  I've got around 10k checks, Warnings do not 
notify, because we have historically had issues with Warning 
notifications (from the contact group setting) going out, then a service 
turning critical and the pager escalations (which only include critical) 
skipping directly to "Page everyone, and a couple managers" because we'd 
already had 3 warning-level notifications.

So, now all contacts have warning notifications disabled.  Which leads 
to missed events.

Is there any way to notify on warnings, without incrementing the 
notification count, and affecting escalations?

What I want is:  Warnings notify, and when a service turns Critical, it 
always starts at step 1 of the escalations.
That way, ops and dev can get notifications about service issues, before 
we get to the point where we need to page about it.  And when it does 
get to be paging time, nagios isn't waking up management at 4am.

I'd love to avoid having duplicate service checks, with a "warning" 
check that has warning notifications enabled, and a "critical" check 
with warning notifications disabled.

Ideal would be some manner of having split escalations, where it tracks 
the number of notifications of a specific state, and escalates based on 
that, but it looks like that requires some serious refactoring of the code.

(Running 3.0.6)
-- 
Mike Lindsey

--

___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Full Throttle Nagios

2010-05-18 Thread Mike Lindsey
Marcel wrote:
> When I have more than, say, 10k checks, I start seen check latency rises 
> and there just isn't anything that could be done, even distributed 
> monitoring have the nagios.cmd write-lock bottleneck.

So, I've just gone through this, and the single greatest bottleneck I 
had to deal with is notifications.  But, I have a lot of people in the 
notification tree, and pull in a lot of meta-data to make ticket 
tracking and issue resolution easier and faster.  Since Nagios needs to 
know the exit status of notification commands, it doesn't fork before 
notifications.. it just plods along waiting for the notification command 
to exit.

I switched all our non-pager notification commands to drop a spool file 
in a directory, letting another process read the spool files, generate 
email contents, query ticket databases, pull in documentation or 
extended testing information (full mysql processlist output, for dbas.. 
etc) and caching it for subsequent notifications for that event.

That showed a HUGE improvement to my master server's performance.

If notifications aren't your bottleneck, you can move all your temporary 
files to ramdisk.

You can also increase your FIFO pipe size, but that only delays the 
issue and doesn't really solve the problem if you're always running hot. 
  It also probably involves recompiling your kernel.

If you're using nsca, you can cache your updates for a second or two, so 
that multiple updates happen in the same socket connection.

Alternately (or additionally) you can have nsca update the checkresults 
directory, directly, skipping the steps where nagios reads the command 
pipe, and then just writes it back out to the checkresults directory.

I can package up a patch (against 2.7.2) of those last couple changes (I 
need to submit them, anyway).  If you're manlier than I might be, you 
could also consider modifying the core nagios to allow submissions from 
distributed nagios servers, directly to a socket, but doing that right 
might require serious threaded c foo, and depending on your OS and 
threading library, you might be locked to a single core.

So, you have options.  They're not all equal, and aren't all easy.  But 
you wouldn't be working with monitoring if you didn't like challenges...  :)

-- 
Mike Lindsey

--

___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] nagios without web interface

2010-05-18 Thread Mike Lindsey
Leonardo Carneiro - Veltrac wrote:
> Hi. I want to compile nagios without the web interface. I think that i 
> should include these parameters in the configure:
> 
> --disable-statusmap
> --disable-statuswrl
> --without-httpd-conf
> 
> Is this right? There is anything else that i should include (or exclude)?

When you run make, just do:

make nagios
make install-base

You could also build everything, and just skip the cgi install, and that 
would probably take less time than getting your answer took.

-- 
Mike Lindsey

--

___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Customizing notifications

2010-02-03 Thread Mike Lindsey
Chip Burke wrote:
> I have a request to “plain English”-ify my notifications. One item I 
> have been asked for is when the service state changes, to report the 
> duration of the previous service state.
> 
> Example: HTTP is now OK after 00:02:35 of down time.
> 
> Is there an easy way to do this? It seems Nagios doesn’t offer a Last 
> State Duration macro, so I am assuming this is going to be a matter of 
> some sort of custom scripting. Has anyone had experience with this sort 
> of thing?

Likely, your best option will be to set up an event handler script for 
that service.   If you already have event handlers configured, and you 
want this logic to run everywhere, consider setting up a script like 
this for your global event handler.

In the event handler, you will want to touch a file in /tmp based on the 
host, service, and state, whenever there's a hard state change.

Like, /tmp/localhost-load-ok...  You could even simplify if all you care 
is ok/not ok.

Then in your notification script, just check for the presence of those 
files, and do your date calculation by pulling the modification date out 
with stat (or script code, if your notification command isn't a chunk of 
bash).


Something like:

now=`date +%s`
if [ "${NAGIOS_LASTSSERVICESTATE}" == "OK"]
then
time=`echo ${now} - ${filetime} | bc`
filetime=`stat -f "%m" /tmp/localhost-load-notok`
else
time=`echo ${now} - ${filetime} | bc`
filetime=`stat -f "%m" /tmp/localhost-load-ok`
fi  

echo "${NAGIOS_SERVICE} is now ${NAGIOS_SERVICESTATE} after ${time} 
seconds."


You might want to flesh it out with some file-exists tests as well.

Good luck!

-- 
Mike Lindsey

--
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Event Handlers

2010-02-03 Thread Mike Lindsey
Marc Powell wrote:
> On Feb 3, 2010, at 8:16 AM, Jeff wrote:
> 
>> I have a service that needs to be monitored every minute.  I need some help 
>> understanding how services go from soft to a hard state
> 
> When a service check results in a non-OK state, services go from a Soft to a 
> Hard state when they reach max_check_attempts. 
> http://nagios.sourceforge.net/docs/3_0/statetypes.html
> 
>> and if an event handler can be run after a service has gone into a hard 
>> state. 
> 
> Only for it's initial Hard problem state or initial Hard recovery state. 
> http://nagios.sourceforge.net/docs/3_0/eventhandlers.html
> 
>> I'm sure everyone has a very dynamic and custom environment to some extent.  
>> I have event handlers that will not run if a lock file is present (cause i 
>> am deploying code or so other scripts do not step on each other).  So I for 
>> this service that I monitor every minute, I have Max Retries set to 3, Check 
>> Interval is 1, and retry interval is 1.  Can someone help shed some light on 
>> how I can get an event handler to run again after a service has gone into a 
>> hard state?
> 
> You can't really... The only real facility nagios has to do this (that I can 
> think of right now) is is_volatile 
> (http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html#service) but 
> that's probably overkill for your needs; particularly the notification 
> implications.

The other possibility for having something run every time the service is 
checked, is to configure your ocsp_command.

Not exactly what it's generally used for, but it'll do in a pinch.

-- 
Mike Lindsey

--
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Overloaded master

2010-01-25 Thread Mike Lindsey
A typical first tier notification goes to 20 people.  One of those will 
be a pager, and is very simple.

The rest are fairly complex.

Notifications include a link to existing and recent tickets in our 
ticketing system (this also allows me to not send a ticket opening 
notification if a ticket already exists)..  I populate the notification 
with links to cacti graphs, links to wiki documentation for the event as 
well as fire off a secondary notification handler that adds in 
additional information based on the host, service, and state.

The first notification of the cycles does all the heavy lifting and 
takes about 6 seconds.  The other 19 finish relatively quickly.

I've been thinking of building a notification server - so I could have 
separate and discrete notification escalations for different service 
states - which would also let me fire off one notification with just the 
contents of $ENV{NAGIOS_*}..  Perhaps that's my best option?

Martin Melin wrote:
> What kind of notifications are you doing and how many are you sending 
> out? Why does a notification cycle take 9 seconds to complete?
> 
> On Sat, Jan 23, 2010 at 12:13 AM, Mike Lindsey  <mailto:mike-nag...@5dninja.net>> wrote:
> 
> What kind of options does one have, if your master nagios server is
> getting overloaded?
> 
> I have half a dozen slaves doing polling, submitting passive check
> results back via send_nsca.  The master does no active polling, just
> event processing, notifications, and web ui.
> 
> Under normal circumstances, it works alright.  But after a restart it
> can take up to half an hour before the master catches up; and if there
> are a lot of events, the act of sending out notifications can cause it
> to fall behind.
> 
> I'm pre-caching my object file, I'm skipping circular dependency checks,
> and I've gotten a notification cycle down to 9 seconds.  I tried
> modifying nagios to fork before notifications, but that failed pretty
> spectacularly; so that 9 seconds is a time where 900 or so passive check
> submissions block until the notifications are done.
> 
> Are there any options for running a dual-master setup, or other ways to
> spread the load across multiple machines?
> 
> Has anyone patched nsca to submit check results into the checkresults
> directory, instead of via the nagios.cmd pipe?  What kind of improvement
> can one expect from that?
> 
> Any other advice?


-- 
Mike Lindsey

--
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Overloaded master

2010-01-22 Thread Mike Lindsey
What kind of options does one have, if your master nagios server is 
getting overloaded?

I have half a dozen slaves doing polling, submitting passive check 
results back via send_nsca.  The master does no active polling, just 
event processing, notifications, and web ui.

Under normal circumstances, it works alright.  But after a restart it 
can take up to half an hour before the master catches up; and if there 
are a lot of events, the act of sending out notifications can cause it 
to fall behind.

I'm pre-caching my object file, I'm skipping circular dependency checks, 
and I've gotten a notification cycle down to 9 seconds.  I tried 
modifying nagios to fork before notifications, but that failed pretty 
spectacularly; so that 9 seconds is a time where 900 or so passive check 
submissions block until the notifications are done.

Are there any options for running a dual-master setup, or other ways to 
spread the load across multiple machines?

Has anyone patched nsca to submit check results into the checkresults 
directory, instead of via the nagios.cmd pipe?  What kind of improvement 
can one expect from that?

Any other advice?

-- 
Mike Lindsey

--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] nagios blocking on notifications?

2010-01-14 Thread Mike Lindsey
Turns out nagios doesn't fork before handling notifications, and also 
waits for the children of any notification commands to exit, so forking 
inside my notification script won't help.

I took the part of the script that was taking 5-6 seconds to complete 
and added in a cache mechanism, which changed the 90+ second 
notification cycle, to a 6-8 second notification cycle.

Might be overkill, but I've also wrapped some fork() logic around the 
service_notification() call inside handle_async_service_check_result()..

Compiles and runs, I'll stress test it tonight and see how it does with 
real load, tomorrow.

Also, if there's a better way to do this, I'm all ears.

Mike Lindsey wrote:
> I've got a high volume site.  Everything seems to keep up reasonably 
> well, unless there are a good number of state changes.  Once services 
> start changing state, and notifications start getting sent out, nagios 
> falls behind.
> 
> Did some digging in the logs and it looks like while a batch of 
> notifications are being sent out, it's rate limiting to about one per 
> five seconds.  Also, from the first notification for a service to the 
> last notification for that service, nothing else is written to the logs.
> 
> Since a typical notification goes out to 15+ people, that's over a 
> minute with no service check handling.
> 
> Is there something going on under the hood that I'm not aware of (like, 
> is it just not doing the log writing, but still doing the passive 
> service check handling, and there's something else causing my latency?)
> 
> Is that delay configurable?  I don't see anything in the docs for that.
> 
> I've even set my notification script to just call and background a 
> secondary script, to try and see if it wasn't a delay in the 
> notification script, but that seemed not to do anything at all.  Should 
> I be forking the notification script instead?
> 
> Here's a log snippet:
> [1263505735] EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;;System Check;0;OK load mem ntp 
> swap cfengine disk|
> [1263505735] EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;;System Check;0;OK load mem ntp 
> swap cfengine disk|
> [1263505735] EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;;System Check;1;WARNING [swap 
> utilization 25%] [/data/ at 77% (inodes 0%)]|
> [1263505735] PASSIVE SERVICE CHECK: 
> ;check_mtime-redlist.txt;0;OK - redlist.txt 102 seconds old
> [1263505735] PASSIVE SERVICE CHECK: ;pre_queuedepth;2;CRITICAL 
> -  pre_queuedepth status: 2159 > 500
> 
> [1263505735] SERVICE NOTIFICATION: 
> ;;pre_queuedepth;CRITICAL;notify-by-email;CRITICAL - 
>  pre_queuedepth status: 2159  500
> [1263505741] SERVICE NOTIFICATION: 
> ;;pre_queuedepth;CRITICAL;notify-by-email;CRITICAL - 
>  pre_queuedepth status: 2159  500
> 
> 
> The SERVICE NOTIFICATION entries keep rolling in every 5-6 seconds for 
> the next minute+, then it goes back to it's usual happy speed.
> 
> Is this an artifact of the way it logs, or is the whole system choking 
> while it sends email?  I've searched the list archives and not found 
> anything on this.
> 


-- 
Mike Lindsey

--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] nagios blocking on notifications?

2010-01-14 Thread Mike Lindsey
I've got a high volume site.  Everything seems to keep up reasonably 
well, unless there are a good number of state changes.  Once services 
start changing state, and notifications start getting sent out, nagios 
falls behind.

Did some digging in the logs and it looks like while a batch of 
notifications are being sent out, it's rate limiting to about one per 
five seconds.  Also, from the first notification for a service to the 
last notification for that service, nothing else is written to the logs.

Since a typical notification goes out to 15+ people, that's over a 
minute with no service check handling.

Is there something going on under the hood that I'm not aware of (like, 
is it just not doing the log writing, but still doing the passive 
service check handling, and there's something else causing my latency?)

Is that delay configurable?  I don't see anything in the docs for that.

I've even set my notification script to just call and background a 
secondary script, to try and see if it wasn't a delay in the 
notification script, but that seemed not to do anything at all.  Should 
I be forking the notification script instead?

Here's a log snippet:
[1263505735] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;;System Check;0;OK load mem ntp 
swap cfengine disk|
[1263505735] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;;System Check;0;OK load mem ntp 
swap cfengine disk|
[1263505735] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;;System Check;1;WARNING [swap 
utilization 25%] [/data/ at 77% (inodes 0%)]|
[1263505735] PASSIVE SERVICE CHECK: 
;check_mtime-redlist.txt;0;OK - redlist.txt 102 seconds old
[1263505735] PASSIVE SERVICE CHECK: ;pre_queuedepth;2;CRITICAL 
-  pre_queuedepth status: 2159 > 500

[1263505735] SERVICE NOTIFICATION: 
;;pre_queuedepth;CRITICAL;notify-by-email;CRITICAL - 
 pre_queuedepth status: 2159  500
[1263505741] SERVICE NOTIFICATION: 
;;pre_queuedepth;CRITICAL;notify-by-email;CRITICAL - 
 pre_queuedepth status: 2159  500


The SERVICE NOTIFICATION entries keep rolling in every 5-6 seconds for 
the next minute+, then it goes back to it's usual happy speed.

Is this an artifact of the way it logs, or is the whole system choking 
while it sends email?  I've searched the list archives and not found 
anything on this.

-- 
Mike Lindsey

--
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null