Re: [Nagios-users] Passive-only master still pinging
In about a month I'll be getting an official request to have freshness checking turned on, with all the commands being something like: echo "Stale check!";exit 2 So, unfortunately ensuring that the master never ever accidentally runs a check is important, even beyond the current issue of hosts without routes. On 10/23/12 7:34 PM, booleanena...@gmail.com wrote: > You can always set the check command for the host to execute the check_dummy > plugin so that if it doesn't get the result and decides to run an active > check check_dummy will force it to be in an up state. > Sent on the Sprint® Now Network from my BlackBerry® > > -----Original Message- > From: Mike Lindsey > Date: Tue, 23 Oct 2012 14:00:36 > To: Nagios Users List > Reply-To: Nagios Users List > Subject: [Nagios-users] Passive-only master still pinging > > I've got a passive-only master that is configured to never execute > checks. Yet it's still performing ping checks for some hosts at some > times. This is mostly just annoying, but when it decides to ping hosts > that it doesn't have a route to, pagers go off. > > I've got 30k services in this config, so debug isn't really an easy option. > > Seeing this on 3.3.1. Any ideas? > > # excerpt from nagios.cfg > accept_passive_host_checks=1 > cached_host_check_horizon=15 > check_for_orphaned_hosts=0 > check_host_freshness=0 > enable_predictive_host_dependency_checks=0 > execute_host_checks=0 > host_inter_check_delay_method=s > max_host_check_spread=30 > obsess_over_hosts=0 > passive_host_checks_are_soft=1 > translate_passive_host_checks=0 > use_aggressive_host_checking=0 > -- Mike Lindsey -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Passive-only master still pinging
I've got a passive-only master that is configured to never execute checks. Yet it's still performing ping checks for some hosts at some times. This is mostly just annoying, but when it decides to ping hosts that it doesn't have a route to, pagers go off. I've got 30k services in this config, so debug isn't really an easy option. Seeing this on 3.3.1. Any ideas? # excerpt from nagios.cfg accept_passive_host_checks=1 cached_host_check_horizon=15 check_for_orphaned_hosts=0 check_host_freshness=0 enable_predictive_host_dependency_checks=0 execute_host_checks=0 host_inter_check_delay_method=s max_host_check_spread=30 obsess_over_hosts=0 passive_host_checks_are_soft=1 translate_passive_host_checks=0 use_aggressive_host_checking=0 -- Mike Lindsey -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Segmentation Fault on config verification
Looks I had a hostgroup that listed itself as a hostgroup member. There were 11 other hostgroup members, and 4220 char temp_hostgroup->members and newmembers strings. In xdata/xodtemplate.c, in xodtemplate_recombobulate_hostgroup_subgroups() the error was occurring in the while loop at: """ strcat(temp_hostgroup->members, newmembers); """ Not entirely sure what the root cause of the segmentation fault (fragmented memory?) might be, but updating my configuration to not include self-referential hostgroups has resolved the issue. On 10/22/12 12:17 PM, Mike Lindsey wrote: > Seeing this on 3.3.1, and 3.4.1. Tried to reproduce with 4, but can't > build from the current git repository. > > Migrating from obj_file to obj_dir style nagios.cfg, and on validation > of my Master configuration I get a Segmentation fault, that looks to be > coming right after Nagios closes nagios.cfg. > > The same format of configuration, generated from the same script works > fine for poller nodes. The main differences in the poller node > configuration is size, no escalations, and no dependencies. > > The end of the truss output: > mmap(0x0,783,PROT_READ,MAP_PRIVATE,5,0x0)= 34365812736 (0x8005cb000) > munmap(0x8005cb000,783) = 0 (0x0) > close(5) = 0 (0x0) > stat("/usr/local/ironport/akeos/bin/tmp/ops-mon-nagios1.vega/timeperiods/workhours.cfg",{ > mode=-rw-r--r-- ,inode=829664,size=389,blksize=4096 }) = 0 (0x0) > open("/usr/local/ironport/akeos/bin/tmp/ops-mon-nagios1.vega/timeperiods/workhours.cfg",O_RDONLY,00) > = 5 (0x5) > fstat(5,{ mode=-rw-r--r-- ,inode=829664,size=389,blksize=4096 }) = 0 (0x0) > mmap(0x0,389,PROT_READ,MAP_PRIVATE,5,0x0)= 34365812736 (0x8005cb000) > munmap(0x8005cb000,389) = 0 (0x0) > close(5) = 0 (0x0) > getdirentries(0x4,0x800d27000,0x1000,0x800d15668,0x80aece00,0x7fffe410) > = 0 (0x0) > lseek(4,0x0,SEEK_SET)= 0 (0x0) > close(4) = 0 (0x0) > munmap(0x8005c9000,) = 0 (0x0) > close(3) = 0 (0x0) > SIGNAL 11 (SIGSEGV) > process exit, rval = 0 > > I'm digging into the source, but if anyone has any ideas, I'm ears. > -- Mike Lindsey -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Segmentation Fault on config verification
Seeing this on 3.3.1, and 3.4.1. Tried to reproduce with 4, but can't build from the current git repository. Migrating from obj_file to obj_dir style nagios.cfg, and on validation of my Master configuration I get a Segmentation fault, that looks to be coming right after Nagios closes nagios.cfg. The same format of configuration, generated from the same script works fine for poller nodes. The main differences in the poller node configuration is size, no escalations, and no dependencies. The end of the truss output: mmap(0x0,783,PROT_READ,MAP_PRIVATE,5,0x0)= 34365812736 (0x8005cb000) munmap(0x8005cb000,783) = 0 (0x0) close(5) = 0 (0x0) stat("/usr/local/ironport/akeos/bin/tmp/ops-mon-nagios1.vega/timeperiods/workhours.cfg",{ mode=-rw-r--r-- ,inode=829664,size=389,blksize=4096 }) = 0 (0x0) open("/usr/local/ironport/akeos/bin/tmp/ops-mon-nagios1.vega/timeperiods/workhours.cfg",O_RDONLY,00) = 5 (0x5) fstat(5,{ mode=-rw-r--r-- ,inode=829664,size=389,blksize=4096 }) = 0 (0x0) mmap(0x0,389,PROT_READ,MAP_PRIVATE,5,0x0)= 34365812736 (0x8005cb000) munmap(0x8005cb000,389) = 0 (0x0) close(5) = 0 (0x0) getdirentries(0x4,0x800d27000,0x1000,0x800d15668,0x80aece00,0x7fffe410) = 0 (0x0) lseek(4,0x0,SEEK_SET)= 0 (0x0) close(4) = 0 (0x0) munmap(0x8005c9000,) = 0 (0x0) close(3) = 0 (0x0) SIGNAL 11 (SIGSEGV) process exit, rval = 0 I'm digging into the source, but if anyone has any ideas, I'm ears. -- Mike Lindsey -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] check_http throwing 141 exit on ssl error
On 9/14/12 11:25 AM, Justin T Pryzby wrote: > This may be unrelated to the question of why it's exiting with a > nonstandard, out of range exit status, but is port 83 really HTTP over > SSL? It seems as if the plugin sent an ssl initiation, and the remote > side closed the connection (perhaps because it wasn't ssl?). > > Later, the plugin tried to gracefully end the ssl session, but the > socket was already closed (ECONNRESET), resulting in EPIPE, which I > think is expected. When the remote device isn't in this current state that's causing it to close inbound connections immediately after the socket is opened, yes, that's an https port. On 9/14/12 11:54 AM, Andreas Ericsson wrote: > On 09/14/2012 08:09 PM, Mike Lindsey wrote: >> I'm typically used to seeing this kind of error code for a missing >> plugin, but I've got a device that is accepting tcp connections and then >> due to a local misconfiguration, immediately closing them. >> >> But rather than a normal critical I'm getting: >> """ >> (Return code of 141 is out of bounds) >> """ >> > SIGPIPE has sig id 13. When a program catches a signal, it returns > the sigid as a negative number, but the field for the exit status > is unsigned, so it gets translated to 128 + sigid instead. > > As I read it back, I realize that doesn't exactly make supersense > to anyone not familiar with integer math as computers do it, but > I can assure you that's the reason. Yup, makes sense now, and if I'd bothered to hit the man page, I'd have groked that. As it was I just assumed 141 was SIGPIPE, so almost there but with an invalid (and irrelevant assumption). >> When run by hand I have: >> """ >> root@ops-mon-nagios3 /usr/local/nagios/libexec $ ./check_http -H >> device.domain.com -w "10" -c "20" -S -p "83" -f follow >> CRITICAL - Cannot make SSL connection >> root@ops-mon-nagios3 /usr/local/nagios/libexec $ echo $? >> 141 >> """ >> >> write(1, "CRITICAL - Cannot make SSL conne"..., 39) = 39 >> write(3, "\200w\1\3\1\0N\0\0\0 >> \0\0009\0\0008\0\0005\0\0\26\0\0\23\0\0\n\7\0\300"..., 121) = -1 EPIPE >> (Broken pipe) >> --- SIGPIPE (Broken pipe) @ 0 (0) --- >> +++ killed by SIGPIPE +++ >> > And there's the SIGPIPE. Case closed. > Would it be appropriate for the check (and potentially any other nagios-plugins check that opens a socket) to trap SIGPIPE and return a normal valid critical? As is, any http or https (or smtp or ldap, etc) check that's hitting a device behaving in this manner, is going to display a non-useful message in the Nagios UI, instead of the actual critical output. This is an error condition on a remote host, covered by what is normally valid working monitoring. If this should be more cleanly caught by lower level parent dependency monitoring, how? check_tcp returns 'ok' because the port opens. If this is expected and desired behavior, should the output be updated to not include the misleading 'CRITICAL' prefix? -- Mike Lindsey -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] check_http throwing 141 exit on ssl error
I'm typically used to seeing this kind of error code for a missing plugin, but I've got a device that is accepting tcp connections and then due to a local misconfiguration, immediately closing them. But rather than a normal critical I'm getting: """ (Return code of 141 is out of bounds) """ When run by hand I have: """ root@ops-mon-nagios3 /usr/local/nagios/libexec $ ./check_http -H device.domain.com -w "10" -c "20" -S -p "83" -f follow CRITICAL - Cannot make SSL connection root@ops-mon-nagios3 /usr/local/nagios/libexec $ echo $? 141 """ Anyone seen this before? Is this resolved in nagios-plugins > 1.4.15? Here's some potentially useful, lightly filtered strace output, showing it exiting on a SIGPIPE: """ socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 connect(3, {sa_family=AF_INET, sin_port=htons(83), sin_addr=inet_addr("68.232.133.59")}, 16) = 0 write(3, "\200w\1\3\1\0N\0\0\0 \0\0009\0\0008\0\0005\0\0\26\0\0\23\0\0\n\7\0\300"..., 121) = -1 ECONNRESET (Connection reset by peer) fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 5), ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ab9e22dd000 write(1, "CRITICAL - Cannot make SSL conne"..., 39) = 39 write(3, "\200w\1\3\1\0N\0\0\0 \0\0009\0\0008\0\0005\0\0\26\0\0\23\0\0\n\7\0\300"..., 121) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- +++ killed by SIGPIPE +++ -- Mike Lindsey -- Mike Lindsey -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] configure receiving snmp traps
On 9/5/12 1:00 AM, Marco Borsani wrote: I read many docs, but I still have problem to configure nagios 3.x to receive the traps. May someone explain the steps to follow to configure correctly this issue ? Is it necessary other SW ? You'll need to ensure that snmptrapd is enabled on your Nagios poller, and the typical route from there to get snmp traps submitted into Nagios is to install SNMPTT. http://snmptt.sourceforge.net/ I recommend reading the docs for these, but, a very basic snmptrapd.conf would be: ## snmptrapd.conf snmpTrapdAddr udp:localhost,udp:YOUR_IP_HERE,tcp:YOUR_IP_HERE authCommunity log,execute public logOption f/var/log/snmptrapd.log traphandle default /usr/sbin/snmptt -i /usr/local/share/snmp/snmptt.ini ## And then in the TrapFiles section of snmptt.ini you might have: ## [TrapFiles] snmptt_conf_files = <# All of these are stateless so the handler script needs to set and clear the service. # The service entry must have 0 retries set and be volatile. # # .1.3.6.1.4.1.15497 # # powerSupplyStatusChange # Status: .1.3.6.1.4.1.15497.1.1.1.8.1.2 EVENT powerSupplyStatusChange .1.3.6.1.4.1.15497.1.1.2.0.2 "asyncos" Critical FORMAT $N trap from $r EXEC /usr/local/nagios/customplugins/submit_trap $r AsyncOS-Trap_Alert $s 0 "$N: $*" # # Your submit_trap script takes that, and hands it off to Nagios. You can submit through NSCA, or you can create a result file in the checkresult directory, or you can submit through the external command pipe. I do it through NSCA: # submit_trap #!/usr/local/bin/bash PATH=/bin:/usr/bin:/usr/local/bin:/usr/local/nagios/customplugins:/usr/local/nagios/bin CONFIG=/usr/local/nagios/etc/send_nsca.cfg NSCA=`hostname` HOST=$1 SERVICE=$2 STATUS=$3 STATEFUL=$4 MESSAGE=$5 case $STATUS in "Critical") CODE=2 ;; "Warning") CODE=1 ;; "Normal") CODE=0 ;; *) CODE=3 ;; esac printf "%s\t%s\t%s\t%s\n" "$HOST" "$SERVICE" $CODE "$MESSAGE" | send_nsca -H $NSCA -c $CONFIG if [[ "$STATEFUL" == "0" ]] && [[ "$STATUS" != "0" ]] then # Clear Nagios via delayed at now that the volatile ticket's gone through. echo "/usr/local/nagios/customplugins/clear.sh $HOST \"$SERVICE\" \"$MESSAGE\"" | at now + 15 minutes fi # ... and clear.sh for clearing stateless alerts. # #!/usr/local/bin/bash PATH=/bin:/usr/bin:/usr/local/bin:/usr/local/nagios/bin:/usr/local/ironport/nagios/bin HOST=$1 SVC=$2 OUT=$3 if [[ "$HOST" == "" ]] || [[ "$SVC" == "" ]] then echo "Need host, service, optional message." exit 3 fi # Clear it printf "%b" "$HOST\t$SVC\t0\tWas:$OUT\n" | send_nsca -H `hostname` -c /usr/local/nagios/etc/send_nsca.cfg fi # If you're using the auto-clear bits, your Nagios user will need to be able to add items to the at queue, you'll need to look at your distribution's documentation on how that's managed. This is just one way of getting snmp traps working. Unfortunately none of them that I know of overly straightforward. Even if this doesn't work for you, it should give enough of an insight so that you've got a better idea on what to google for. Good luck. -- Mike Lindsey -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios 3.3.1, event brokers, and debug.
On 2/24/12 2:26 AM, Andreas Ericsson wrote: > I'd have to send you the new Nagios code and get Sven to help me patch > mod_gearman to avoid using threads, but if you want to give it a shot, > I'm sure we could have you up and running in notime. We're going with something else in the short term, probably updating our obsess commands to send data to multiple local servers, and pushing freshness checking off the master, onto those local failover nodes. We finally reached the point where freshness checking on the master in Nevada, if there was a problem in Europe, could actually make the master fall far enough behind in processing passive checks.. causing cascading failures. Since we have a corporate directive to "Go Linux" we're just taking this as one among *many* reasons to accelerate our migration plan. -- Mike Lindsey -- Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios 3.3.1, event brokers, and debug.
On 2/23/12 10:50 AM, Sven Nierlein wrote: > On 2/23/12 19:33, Mike Lindsey wrote: >> Turns out that's the problem. I've rebuilt from source and it loads, >> now to get our package maintainer to rebuild the package. And to >> figure out why mod_gearman_worker's children keep segfaulting. > > Seems to be freebsd related. A colleague could reproduce that with > freebsd 8. Any advice short of rebuilding my entire infrastructure? -- Mike Lindsey -- Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios 3.3.1, event brokers, and debug.
On 2/23/12 2:16 AM, Sven Nierlein wrote: > Hi Mike, > > Please don't hijack other threads. Apologies. Unintentional thread header jacking. > Make sure you have eventbroker handling compiled in. > (--enable-event-broker). > Also consider using the latest stable 3.2.3 which has been > successfully tested with > Mod-Gearman. I never tried the 3.3.1. Turns out that's the problem. I've rebuilt from source and it loads, now to get our package maintainer to rebuild the package. And to figure out why mod_gearman_worker's children keep segfaulting. It *looks* like gearman works fine with 3.3.1. At the very least I see jobs going into the queue. -- Mike Lindsey -- Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Nagios 3.3.1, event brokers, and debug.
I'm trying to test out mod_gearman, but I don't see any message about the event broker loading in the main logfile, and enabling debug logging just results in a blank debug log file. From nagios.cfg: debug_file=/usr/local/nagios/var/nagios.debug debug_level=66 debug_verbosity=2 max_debug_file_size=1000 event_broker_options=-1 broker_module=/usr/local/nagios/lib/mod_gearman.o config=/usr/local/nagios/etc/gearman.cfg From nagios.log: 1329961651] Successfully shutdown... (PID=79938) [1329961654] Nagios 3.3.1 starting... (PID=81413) [1329961654] Local time is Wed Feb 22 17:47:34 PST 2012 [1329961654] LOG VERSION: 2.0 [1329961655] Finished daemonizing... (New PID=81414) nagios.debug is empty. Any advice? -- Mike Lindsey -- Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] DNX dead?
Is DNX officially a dead project? Last post on the developer's list is from May of last year - and got no response. Last thread is from two months before that, the last release is from two years ago, and the documentation is even older. -- Mike Lindsey -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Dynamically add/remove hosts on Nagios
On 2/9/12 5:32 PM, Felipe Cecagno wrote: The problem is that I want to add and remove instances dynamically, I don't want to manually modify hosts.cfg on the central each time I change my infrastructure. So my idea was that when a new instance gets up, it will send to Nagios something like (always using NSCA): "localhost Server UP 0 " I believe XI has a feature that does some automatic adding of hosts/services from passive checks. To get it working in Core you need to go about it in a different way. You could potentially have an event handler mechanism that works it.. Say have a "add-host" service on your master Nagios host. So when a new host is added to a cluster you trigger a passive result for that check, that check's event handler kicks off and adds the configuration (using some generous templating) for that host to your master config, pre-caches your object cache and restarts. You could also have a "del-host" service that does the reverse if you're feeling brave. For our environment, I took yet another route. Our CMDB provides an API where I can ask for all our "production updates www" hosts. So to get auto-updating cluster level monitoring for my environment I have a host entry (trimmed for brevity, and such) like: define host { host_name cluster-updates-www alias cluster-updates-www address cluster-updates-www hostgroups All,cluster,updates,updates-www,www check_command cluster_ping _ENVIRONMENTprod _PRODUCTupdates _PURPOSEwww } It doesn't matter for this, if that hostname is in DNS, as nothing actually queries it. Don't need an ip address either, because nothing uses it. Check command in this instance is: define command { command_namecluster_ping command_line$USER5$/cluster_check.py --product $ARG1$ --purpose $ARG2$ --script '$USER1$/check_icmp' --args '-H %HOST% -c 1800.0,100% -n 2 -t 2' } So the cluster_check.py asks the CMDB for a list of hosts when it first runs, then caches that list for an hour. It pings all the hosts in parallel, sums up the stats and does the right thing. Unfortunately this kind of process requires some external query-able source of truth. Our CMDB (internally developed, not currently releasable) provides a JSON dump on an http port.. If you've got anything you can query or parse - even an rsync'd dns zone file you should be able to cobble something together that works similarly via bash, perl, or whatever works for you. Auto-updating cloud/cluster monitoring and no configuration updates or restart needed. Same script and methodology works for services. Doesn't help you, however if you absolutely must have unique host and service entries I've attached my script if you want to rip it apart and use it for your environment. You'll have to replace any of the bits that mention 'asdb' with code that queries and parses your CMDB, or whatever. It's a good bit of effort, but potentially worth it.. good luck. (Or, there's XI) -- Mike Lindsey cluster_check.py.gz Description: GNU Zip compressed data -- Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] nsca old server with new nsca client
On 2/7/12 12:10 PM, Albert Shih wrote: > Hi all, > Is there any way I can use a new (2.9.1) client nsca (send_nsca) with old > server (2.7.x) ? 2.9.1 shouldn't include any backwards incompatible code. That said, the normal cross-version issues have been with newer server, and older client so I'm not sure "old server" has been sufficiently tested with "new client"... Is there a particular reason why you can't upgrade your server side? -- Mike Lindsey -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Issue with distributed Host checks
I'm seeing oddities with my host checks. These are all on 3.2.1, and I do not have Host dependencies for the hosts in question. A worker node will detect a host as being down and send back a soft passive result. In many cases, the master will then immediately perform an active host check which is NOT logged. That host check will result in a hard state change, even though host checks are set for 2 retries at 1 minute intervals. Anyone know what's going on, or do I need to go read the source? Here's the relevant entries from the master node's nagios.cfg: $ grep host nagios.cfg accept_passive_host_checks=1 cached_host_check_horizon=15 check_for_orphaned_hosts=0 check_host_freshness=0 enable_predictive_host_dependency_checks=1 execute_host_checks=0 global_host_event_handler=event_handler high_host_flap_threshold=20.0 host_check_timeout=30 host_freshness_check_interval=60 host_inter_check_delay_method=s host_perfdata_file=/usr/local/nagios/var/host-perfdata.dat host_perfdata_file_mode=a log_host_retries=1 low_host_flap_threshold=5.0 max_host_check_spread=30 obsess_over_hosts=0 passive_host_checks_are_soft=1 retained_contact_host_attribute_mask=0 retained_host_attribute_mask=0 retained_process_host_attribute_mask=0 translate_passive_host_checks=0 use_aggressive_host_checking=0 And here's an example host object: define host { host_name address hostgroups All,cres,cres-dbss,cres-prod-dbss,cres-prod-dbss.soma,dbss,linux2,soma check_command check-host-alive max_check_attempts 2 check_interval 3 retry_interval 1 active_checks_enabled1 passive_checks_enabled 1 check_period24x7 obsess_over_host1 check_freshness 1 flap_detection_enabled 1 process_perf_data 0 retain_status_information 1 retain_nonstatus_information 0 contact_groups sysops notifications_enabled 1 notification_interval 60 notification_period 24x7 notification_options d,u,r,f notes_url https:///cacti/graph_view.php?action=preview&host_id=0&graph_template_id=0&filter= action_url /nagios/cgi-bin/extui.py?host=.com _ENVIRONMENTprod _HARDWARE R710 _LOCATION soma _OS Linux 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 _PORTFOLIO Encryption _PRODUCTcres _PURPOSEdbss _RACK 07--11 _SERIAL 536QNM1 _SOURCE ASDB/Servers _SOURCE_URL https:///servers/admin/servers/server/3363/ __SNMP_COMMUNITY } -- Mike Lindsey -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Opinions on load balancing and failover mechanisms
There are a lot of options.. DNX, Merlin, mod_gearman to name a few... I could read the docs (and have read a good portion of some of them) and could implement test environments (and will eventually need to) but first I want opinions from people who've done this at large scale. I need to improve on our load distribution and failover mechanisms. Right now worker node outages are handled through freshness checking, and master node outages are handled through a load balanced vip and some fancy cron jobs that kick up a cold spare. What are the better options for local load distribution and geographic master failover? Which options will better handle thousands of servers across a dozen colos, in half a dozen countries, when the goal is that no single host (or colo!) going offline can be allowed to have an effect on any other subset of the infrastructure? Which options should I avoid? Currently running Nagios Core 3.2.1 with NSCA 2.9 on mostly FreeBSD systems. Soon that should be Core 3.3, with XI on top, plus whatever load distribution mechanism wins the dog fight. -- Mike Lindsey -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] how to avoid host dependency with check_cluster
On 1/13/12 2:44 AM, Morty wrote: > I've gotten check_cluster working to monitor a service cluster per the > docs. It works at a basic level. Thanks! > > Problem: service definitions seem to require an associated host or > hostgroup. I don't want to tie the check_cluster to an individual > host, because if that host goes down, a different host in the cluster > could still be up. But I also don't want it tied to every host in the > cluster because then I could get duplicate notifications. > > What am I missing? > The easy solution. Add a host entry called "WWW-Cluster1" or whatever you want it to be called.. Set the ip address to 127.0.0.1, and your host up/down check to some always-alive check ("echo OK" will do fine). Then attach your cluster services to that. You could even have your host availability check be a check_cluster command that pings all your hosts. -- Mike Lindsey -- RSA(R) Conference 2012 Mar 27 - Feb 2 Save $400 by Jan. 27 Register now! http://p.sf.net/sfu/rsa-sfdev2dev2 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] How to route data from multiple nagios core nodes to a nagiosxi node?
On 11/8/11 6:06 PM, Benjamin wrote: > I have about ten nagios core machines that I currently monitor > collectively using MNTOS. Is there a way to feed the data from my > nagios core machines to a nagiosXI machine that makes it possible to > use the nagiosXI features like the visualization/dashboards/reporting > for the services/hosts being monitored by the nagios core nodes? I > basically want to replace MNTOS w/ a nagiosXI machine — so that I can > utilize its features as a dashboard and reporting node for all the > data I receive at each nagios core server. > > Basically, I want to set up a hub and spoke w/ a nagiosXI machine at > the hub and all my nagios core boxes as spokes. > I looked into DNX but this looks like it distributes the checks in a > different way. Any ideas? I'm not as familiar with XI as I'd like, but based on the demos and the docs, I think NSCA will do just fine. I believe the configuration would be exactly the same as with a normal Nagios Core install. -- Mike Lindsey -- RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] nagios 2.9 doesn't send emails anymore.
On 10/27/11 3:15 AM, Mario Garcia Ortiz wrote: > Hello > we have this strange issue on a nagios server running version 2.9. > > all of a sudden, we stopped receiving notifications > the web interface shows that the notifications are sent to all the > contacts but nothing is actually sent. there's nothing on the syslog > of the server; if we send manually a mail via the command line that is > sent but nothing that is sent by nagios process itself. > > what could be the problem here. > thank you > Many things could be the problem here, only some of them would be "Nagios." If you were running 3, I'd say turn on debug logging. Since you're not, this gets hard (or you could just upgrade to 3, see if the problem disappears, and if it doesn't, turn on that debug log.) What's your notification command config look like? What happens if you add: """ >/tmp/notif_log.out 2>&1 """ to the end of it? That should trap the command stdout and stderr, and save it in that file. -- Mike Lindsey -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Question about Nagios Features
nalysis - Be able to provide analysis on traffic > flow as based on NetFlow, SFlow and/or JFlow > 29. Netflow, SFlow, JFlow Support - Be able to support > devices/elements running IP Flows and display statistics as based on > information > 30. Netflow Reporting - Be able to provide Flow reports > 31. Application Traffic Information - Be able to provide statistics as > based on application traffic information > 32. Demographics - Be able to provide information on top users and top > applications > 33. VoIP QoS Measurement - Be able to provide statistics as based on > VoIP QoS measured values > 34. Alerts, Grahps and Reporting - Be able to provide alerts, graphs > and reports as based on statistics gathered > 35. VoIP Infrastructure Monitoring - Be able to monitor VoIP network > infrastructure and provide information on network health for support > of VoIP service > -- Mike Lindsey -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] escalations question
On 10/26/11 10:27 AM, Paul M. Dubuc wrote: > Michael Barrett wrote: >> Is there anyway to get that sort of setup working btw? > You might re-think why you want to do this. If there has been a problem at > the warning level for 2 or more notification intervals without it being > acknowledged (which stops notifications) or fixed, maybe your secondary > contact should be notified anyway when the critical threshold is exceeded. When you have multiple levels of management in your escalation trees, this particular kind of behavior is to be avoided at all cost. :) > If you really want it to work the way you describe then the best solution I > can think of is to have 2 separate services with different contacts. One that > issues only warnings and the other only critical problems. But then you've > doubled the number of checks you are doing for the same problem. > There's a split-tier notification patch that seems to handle this pretty well. Standard escalation configuration stanzas work the same, but a few new ones are added that allow discrete escalations based on notification number AND type. Barring that you'll have to handle it some monkey-patched after-thought, in your notification scripts. You can search the forums (or google) for the patch. If that fails, I can probably find it later. I believe it was compatible with Nagios 3.2. -- Mike Lindsey -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Notifications for Services that are "Up"
On 9/2/11 7:59 AM, Michael Loiselle wrote: I am currently running Nagios 3.3.1 on Ubuntu 10.04 and everything is working great. I am monitoring 30 Windows servers with NSClient++, and that is working as well. Is it possible to receive a notification for a specific service that is in the "up" position? In other words, I would like to get a notification every 4 hours, confirming that a service is actually running. If it stops on Friday evening, I do not want to wait until Monday morning to find out it is not running. I would personally rather receive a message every four hours. Notifications are currently set up and working flawlessly, so I just need to know what I need to change in the config files to get this to work. Any help is appreciated. Not entirely sure what you're trying to solve here? Generally it is good if you can get to a place where you trust your monitoring system. It should be telling you if something has broken, but if something continues to run well, the monitoring system should shut up and not bother you. If it stops on Friday evening, you should get an email immediately (or as soon as Nagios notices it...) If by "it stops" you mean "Nagios stops" then that's a separate problem - I have a secondary system that ONLY monitors my primary Nagios infrastructure. If the primary system fails, the secondary system emails me - well, pages me, my secondary, all of ProdOps, etc. TL;DR: No easy way to have an OK service automatically email. Nagios makes noise when things are broken, not when they're working. -- Mike Lindsey -- Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free "Love Thy Logs" t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] extraneous data in status file, for custom macro
I'm seeing odd data in my status file, for custom macros: _ENVIRONMENT=0;prod _PORTFOLIO=0;Internal _PRODUCT=0;monitoring _PURPOSE=0;extnagios _SOURCE=0;ASDB/Rolemaps Where are the 0; bits coming from, and what do they signify? -- Mike Lindsey -- Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free "Love Thy Logs" t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Having users view only their hosts/services
On 8/17/11 10:55 AM, Edwin Zoeller wrote: I know that this question has been posted many times before but what I am looking for is where I went wrong and if someone has an easy method. I have setup various users to view there information in Nagios and all is good. But for some reason, when I setup users now, it gets them to login but displays the error message the "not authorized to view..." I have no clue what I have done wrong. Any help would be great. Your users can only see services and hosts for which they are contacts. This means that your login names must be the same as your contact names. http://nagios.sourceforge.net/docs/3_0/cgiauth.html -- Mike Lindsey -- Get a FREE DOWNLOAD! and learn more about uberSVN rich system, user administration capabilities and model configuration. Take the hassle out of deploying and managing Subversion and the tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios authentication thru LDAP.
On 8/10/11 9:23 AM, Robert J Molerio wrote: > Can anyone indicate how this can be done? > We would like users to log on to Nagios via LDAP. > I think we need to configure the Apache server within Nagios to be > able to do this but we're not sure. Depending on your version of Apache this ranges from a pain in the rear, to nigh impossible. It's doable, but I've often found it easier and more stable, to have a cronjob that exports the ldap users to an htpasswd file. Requires fewer changes to your Apache installation, and doesn't lock your users out of your Nagios install if LDAP fails. -- Mike Lindsey -- uberSVN's rich system and user administration capabilities and model configuration take the hassle out of deploying and managing Subversion and the tools developers use with it. Learn more about uberSVN and get a free download at: http://p.sf.net/sfu/wandisco-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Eternally pending, stale checks
I deployed new monitoring today, and despite a few restarts and many hours of waiting, 185/220 services are still pending. It's a 3.2.1 environment (yes, yes, upgrade, yes) with one master and multiple pollers. All this new monitoring is on one polling host. Active checks are disabled on the master, passive checks are submitted via NSCA. Freshness threshold is set to 20 minutes for checks with a 5 minute interval. The polling host executes the checks, has the right data in the status.log, but the master never receives some of the check data. The data it does receive is not consistently grouped. Service A on one host will submit consistently, but the same service on a different host will fail to submit. The master will, every 20 minutes throw messages about the checks being stale, and needing to force an immediate check, but that never seems to make it's way through. My next step, I suppose will be enabling debug mode on the master, but if history is any indication, that will cause the problem to stop happening - in addition to it being a pain to parse through debug logs for a 10k service environment. If anyone has ideas on what else to check, I'm ears. -- Mike Lindsey -- BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA The must-attend event for mobile developers. Connect with experts. Get tools for creating Super Apps. See the latest technologies. Sessions, hands-on labs, demos & much more. Register early & save! http://p.sf.net/sfu/rim-blackberry-1 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Attempting to execute bash script stored in Service Meta variable, referenced in command_line
Uncleaned macro. Running output (237): 'bash -c 'mailflow_rate.py -H xxhostxx -u user -p -s "1200" -d "db" -t "table" -w "$NAGIOS__SERVICEWARN" -c "$NAGIOS__SERVICECRIT' Not currently in macro. Running output (244): 'bash -c 'mailflow_rate.py -H xxhostxx -u user -p -s "1200" -d "db" -t "table" -w "$NAGIOS__SERVICEWARN" -c "$NAGIOS__SERVICECRIT" 2>&1'' Done. Final output: 'bash -c 'mailflow_rate.py -H xxhostxx -u user -p -s "1200" -d "db" -t "table" -w "$NAGIOS__SERVICEWARN" -c "$NAGIOS__SERVICECRIT" 2>&1'' Short Output: Usage: mailflow_rate.py [options] Long Output: \nmailflow_rate.py: error: option -w: invalid integer value: '$((/path/_slice_threshold.sh WARN))'\n So, the final command has the right environment macros, and it does pull the string I expect out of the first service meta variable, but bash doesn't do the command substitution, just hands the unsubstituted string off to the poller script. Any ideas on how I can get this working right? -- Mike Lindsey -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Expanding Custom Variables
On 6/29/11 12:18 PM, Stringham, Steven wrote: > I am trying to monitor multiple volumes on a NetApp system. The format of the command requires a > hostname:volumename format. I want to reduce my commands/service definitions to a minimum. My initial > thought was to have a generic service definition, that gets more specific with a sub definition. When the > command is run, it seems like it is not passing the custom variable, but rather leaving a single $ behind where > the variable ought to be. I'm not sure that custom macros are evaluated at the command level? Perhaps set your command_line to pull in the variable from the service: define service { name NA_SnapMirror check_command netapp_snapmirror!$_SERVICEnavolume$ use GenericService_Core normal_check_interval 1000 max_check_attempts 300 register 0 contact_groups CoreServers } define service { use NA_SnapMirror _navolume myvolumename service_description SnapMirror_groups hosts myhostname } define command { command_name netapp_snapmirror command_line $USER1$/check_naf.py -H $HOSTADDRESS$ -C $USER8$ snapmirror,$HOSTNAME$:$ARG1$,$USER25$ } ... Alternately, if you have enable_environment_macros=1 in nagios.cfg, you could instead put "$NAGIOS__SERVICEnavolume" and pass the reference to the script. One of the two should work for you. If not, then I'd recommend restarting in debug mode, debug_level=18 will get you debug information about both the configuration load process, and the service check execution, so you should be able to figure out the problem - just fire it up in a reduced config set, so you only have this in there and don't get spammed by normal operations. What version, btw? -- Mike Lindsey -- Mike Lindsey -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Host acknowledgments not respecting "persistent"
Host acks are being cleared on restart with 3.2.1, checking the changelog, I don't see any fixes for this. The command ending up in the log is: [1308256491] EXTERNAL COMMAND: ACKNOWLEDGE_HOST_PROBLEM;xxx;2;0;1;Mike Lindsey;testing restarts It's interesting that the sticky bit is being set to '2'... But from looking at the code, it seems like it just tests for boolean, so that should be fine. Persistent is set to 1, but on a restart the host is no longer acked. Has this been silently fixed in 3.2.2, or 3.2.3? Is it on the roadmap for 3.2.4? -- Mike Lindsey -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] CMDB backend and feeder mechanism?
I'm interesting in hearing what kind of CMDB people tend to use, who also use Nagios. How do you track and maintain your hosts, how do you map that into your Nagios configuration? I'm particularly interested in those with particularly complex environments. We have an internally developed CMDB, as an internally developed Nagios configuration management tool, that I'm working on getting final approval to release as open source. I'd love a chance to pick some brains about what works, and particularly what doesn't work. -- Mike Lindsey -- What Every C/C++ and Fortran developer Should Know! Read this article and learn how Intel has extended the reach of its next-generation tools to help Windows* and Linux* C/C++ and Fortran developers boost performance applications - including clusters. http://p.sf.net/sfu/intel-dev2devmay ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] acknowledge triggers a script
On 5/10/11 12:20 PM, dave stern - e-mail.pluribus.unum wrote: > We have an interesting need. When a particular service goes red on our > Nagios 3.2.1 server, we'd like to be able to click on "Acknowledge this > service problem" and have that activate a local script. Anyone have any > idea how this can be accomplished? Add a secondary CGI (bash script, perl, python, etc) and link to that using the ACTION_URL config variable for your hosts/service. Then you can have a link there that submits the ack command as well as executes your secondary command. If you don't want that intermediary step you're going to need to update the cgi source. -- Mike Lindsey -- Achieve unprecedented app performance and reliability What every C/C++ and Fortran developer should know. Learn how Intel has extended the reach of its next-generation tools to help boost performance applications - inlcuding clusters. http://p.sf.net/sfu/intel-dev2devmay ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] enable SNMP trap handling in Nagios
On 5/10/11 9:46 AM, khurram aziz wrote: Hi, i am using Nagios 3.2.3 & want to enable SNMP Trap Handling so that I can check uptime of my servers ( snmp service has already been enabled on the servers). can sum1 help me with the configuration. Well, first off you don't need snmp traps to check uptime. $ /usr/local/nagios/libexec/check_snmp -H localhost -o .1.3.6.1.2.1.1.3.0 -C xx SNMP OK - Timeticks: (362041566) 41 days, 21:40:15.66 | $ If all you want to see is an uptime counter, add a service that does that. Replace localhost with $HOSTADDRESS$, and replace xx with your snmp community. Unfortunately, check_snmp doesn't seem to support having warning or critical thresholds, so unless snmpd is down, that will always return ok. You can use snmpget to get the raw timeticks: $ snmpget -Ovt -v2c -c x localhost .1.3.6.1.2.1.1.3.0 36207797 $ If you want a critical alert every time a box has rebooted, write a shell script that calls that snmpget command, passing in the host address and snmp community via the command line. Of course, that will throw an Unknown while the box is actually down (snmp can't tell if the host is down, or if you've passed the wrong snmp community.) If what you really want is to know when your box is down, use check_ssh: $ /usr/local/nagios/libexec/check_ssh localhost SSH OK - OpenSSH_5.1p1 FreeBSD-20080901 (protocol 2.0) $ That will throw a critical any time it can't connect, or if it can connect but the ssh version string isn't found. If you still want to use snmp traps, here's a link to some lovely documentation: http://snmptt.sourceforge.net/docs/snmptt.shtml There's even a section on integrating with Nagios, though I suggest you get some coffee and a snack and read the whole page. -- Mike Lindsey -- Achieve unprecedented app performance and reliability What every C/C++ and Fortran developer should know. Learn how Intel has extended the reach of its next-generation tools to help boost performance applications - inlcuding clusters. http://p.sf.net/sfu/intel-dev2devmay___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Service failover dependency
On 4/7/11 2:45 AM, Andrey Mitroshin wrote: > I'm afraid I did not explained my problem clearly. > > So, I've got 2 servers. > serverA - primary (10.0.0.1) > serverB - backup (10.1.0.1) > > There is some apache vhost named www.site.com configured on both of them. > My failover is supposed to work as follows. > > Usually serverA is up and www.site.com resolves to 10.0.0.1. > when serverA.com goes down, nagios executes evenhandler, updates A > record and www.site.com points to 10.1.0.1 (serverB). > > The problem arises when both servers are down. So, evenhandler updates > A record of www.sites.com, but serverB is down as well > > My goal is to avoid executing eventhandler when serverB is down. > And the question is how to configure such a behaviour in nagios. What you need here, is a smarter event handler. In the event handler script, have it test serverB. If serverB is up, update the A record; otherwise just exit (and maybe update a logfile somewhere). -- Mike Lindsey -- Xperia(TM) PLAY It's a major breakthrough. An authentic gaming smartphone on the nation's most reliable network. And it wants your games. http://p.sf.net/sfu/verizon-sfdev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Check_APC_PDU Command Definition
Often, when you're getting an error and the only result you see is (null), what is happening is that your check script is printing to stderr. It might be that you have perl in your path, but the perl script's #! line doesn't declare the full path to perl, or there's an access error of some sort. But it's easy to figure out what's going on. Simply change your command_line to: command_line$USER1$/check_apc_pdu.pl -H $HOSTADDRESS$ -C public 2>&1 That will redirect standard error to standard out. Next time Nagios runs the script it will capture the full output of the script and you should see right in your Nagios ui, what the issue is. Sun, Mar 27, 2011 at 2:45 PM, Peter Roddan mailto:peter.rod...@sbsworldwide.com>> wrote: If I log onto the nagios server as the nagios user, and run the command from the libexec folder (check_apc_pdu --H -C public) I get the response : "OK: All Outlets ok. | load=25" I have put the following command definition in : # 'check_apc_pdu' command definition define command{ command_name check_apc_pdu command_line $USER1$/check_apc_pdu.pl <http://check_apc_pdu.pl> -H $HOSTADDRESS$ -C public } And defined the following service define service{ use generic-service; Inherit values from a template hostgroup_name apc ; The name of the host the service is associated with service_description check_apc ; The service description check_command check_apc_pdu ; The command used to monitor the service normal_check_interval 5 ; Check the service every 5 minutes under normal conditions retry_check_interval 1 ; Re-check the service every minute until its final/hard state is determined } However, my APC PDUs report an error for this service, with a status information of "(Null)" I'd be grateful for anyone who could point me in the right direction of where I'm going wrong. -- Mike Lindsey -- Xperia(TM) PLAY It's a major breakthrough. An authentic gaming smartphone on the nation's most reliable network. And it wants your games. http://p.sf.net/sfu/verizon-sfdev___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Long running notification script
I have a notification command that will typically take longer to run, than my notification timeout. I don't particularly care, if Nagios gets a valid return code back, so I set the main script to fork twice, with the initial process printing 'OK' and exiting with a return code of 0. The child process also exits immediately with a return code of 0, while the grandchild hangs around to do some heavy lifting. I was hoping that the double-fork would keep Nagios from blocking on the process, but the debug logs are still showing: [1300401208.452280] [032.1] [pid=55343] Adding normal contacts for service to notification list. [1300401239.455867] [032.0] [pid=55343] 1 contacts were notified. Next possible notification time: Fri Mar 18 03:33:28 2011 When I'm expecting the '1 contacts were notified' to happen pretty much immediately. Any ideas to get around this, other than writing out a spool file and having a secondary daemon handle the heavy lifting? -- Mike Lindsey -- Colocation vs. Managed Hosting A question and answer guide to determining the best fit for your organization - today and in the future. http://p.sf.net/sfu/internap-sfd2d ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] ServiceGroup Best Practices Question
On 3/10/11 12:25 PM, David Harbaugh wrote: I'm new to Nagios. Running Nagios 3.2.3. I want to start using Service Groups, but I'm not sure of the best place to put the service group definitions. What is making me question the location is eventually I will want to create a service group that contains services hosted on both Linux and Windows machines, so I'm thinking of creating a new config file to hold the service groups, then in nagios.cfg use cfg_file= to load it in after the Windows and Linux machines are loaded. Where do you create service groups? I put all my service groups in 'service_groups.cfg'. The order in which you specify them in nagios.cfg does not matter. -- Mike Lindsey -- Colocation vs. Managed Hosting A question and answer guide to determining the best fit for your organization - today and in the future. http://p.sf.net/sfu/internap-sfd2d___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] nagios server redundancy
On 2/11/11 10:26 AM, Morty wrote: > I'm looking to implement redundant nagios servers, with the backup > server in a different location than the prime server. This is nagios > 3.2.3, with the default web interface. I'm synchronizing > configurations by rsyncing /usr/local/nagios/etc/ between systems. > I'm doing active/active (i.e. I want the backup server monitoring at > the same time as the prime server.) So far so good. > > Problem: acknowledgements on the prime are not being synced to the > backup. > > Is there a (clean) way to sync the prime's acknowledgements to the > backup, as well? I'm tempted to shut down the backup, rsync the > prime's var directory to the backup, and then bring the backup back > online. But the docs have various warnings about not messing with the > var files, so figured I'd ask about possible hidden gotchas. > > I've read http://nagios.sourceforge.net/docs/3_0/redundancy.html, but > scenario one doesn't discuss syncing acknowledgements, and scenario 2 > is active/passive. What I end up doing with my backup master is leave it off, with frequent rsyncs of both config and the status files in var. Both the active master and the backup master are sitting behind a load balanced vip, with the nsca and http/https ports managed by the load balancer. There's a cronjob running on the backup master that, if it determines an error on the active master, starts up nsca, nagios, and apache. That causes the vip to fail over to the backup master, giving automatic recover with no more than five minutes of downtime (the frequency of the cronjob). The active master does not have apache, nsca, or nagios configured to start on boot, instead those are also managed by a cronjob that does a check of the backup master. If the backup master is running apache/nagios/nsca, then the active master doesn't start up (and if they're already running, say from an intermittent error, they shut down) and the rsyncs also don't happen. This allows me to do automatic failover, and manual fail-back, after whatever issue triggered the failover has been verified and resolved. You cannot - to the best of my knowledge - sync acknowledgments to a backup server while it's actively running, unless you want to write something that checks for new acks and dumps them into the command pipe. So, if you want to maintain acks and downtime, you'll need to have your backup disabled for the syncs. -- Mike Lindsey -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios storing data into PostgreSQL?
Another potential option is setting ups a 'postgres' contact, with a custom notification command. Pass the data you want to store, to the notification command, and let that dump the data to the database. This can potentially be as simple as a bash script that takes the input, builds a sql statement and echos it to the postgresql client binary. On 2/8/11 3:13 PM, Michael Friedrich wrote: On 07.02.2011 22:49, larry johnson wrote: Hello, i am newbie to linux/Nagios and need help to clear some doubts. I wander whether is possible to make Nagios writing notification (host up/host down, for example) into PostgreSQL database? if you write your own NEB broker module, and put that onto libpq or similar, the core will be able to, sure. Or you'll have a look at Icinga IDOUtils which support Postgresql quite a while now. I found NDOUtils, but this addon does not suit me because i don't use MySQL. Well there aren't that much alternatives to that. Merlin supports MySQL and Oracle (in development on git). I'm not sure if the Centreon Broker is already released which *should* support more RDBMS. But for notifications only, why not using event handlers? then you could call scripts putting data into your rdbms the preferred way. http://nagios.sourceforge.net/docs/3_0/eventhandlers.html kind regards, Michael Also found that this kind of storage is suported under Nagios 1.x, but what about 3.x? I run Nagios 3.2.3 (with 1.4.15 plugins) on openSUSE 11.3. Regards.// Mike Lindsey -- The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Check behavior during the notification event
On 1/5/11 11:27 PM, Yu Watanabe wrote: > Thank you for the reply. > > I understood that notification events will hang up the normal service check > events. > > I was bit curious about your comment. > >> A lot of people end up writing external notification handlers to take the >> load off of Nagios so the scheduled checks can continue whilst the external >> app > queues and processes the notifications. > > If you could share your knowledge it would be helpful. > Does people create external application that scans the nagios.log without > using > any of Event Handlers or Notification Events of Nagios? Or perhaps event > broker? > What I ended up doing was having notification commands that drop a spool file that contains the Nagios environment macros, into a directory. A second daemon reads in the spool files for the notification event, collects all the meta data (product dependencies, runbook links, ticket links, etc), caches it for the subsequent contacts. The spool file write is very quick, letting Nagios get back to dealing with check submission handling, and the secondary daemon takes a serial, blocking process, and does all the heavy lifting for notification generation and email in a fairly parallel process. A notification for us will include upwards of 20 contacts (some email lists, some individuals, some ticketing and tracking systems, and some pagers).. At the height of the "bad times" Nagios would block for 40 or so seconds, sending out every single notification serially. Just dumping out a spool file for each contact happens in a small fraction of a second. Pager contacts are still straight Nagios to postfix, because simple is better for anything where you're actually waking someone up at 3am. -- Mike Lindsey -- Learn how Oracle Real Application Clusters (RAC) One Node allows customers to consolidate database storage, standardize their database environment, and, should the need arise, upgrade to a full multi-node Oracle RAC database without downtime or disruption http://p.sf.net/sfu/oracle-sfdevnl ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Missing notifications?
On 1/3/11 11:53 AM, James Moseley wrote: I was directed in the irc forum to look at http://nagios.sourceforge.net/docs/3_0/statetypes.html and I learned that services or hosts need to have a hard state failure, in order for notifications to go out. I've been trying to reboot my servers in order to get my notifications to go out. I'd like to check that I'll actually get notifications as soon as there is a problem, for example, a host goes offline, I'd like to know after about 2 minutes if possible! Then just turn a server off and wait a bit... ;-) You could also create a fake host, one with an unpingable IP address. My favorite trick, that doesn't require adding fake config or rebooting a server, is updating /etc/hosts on the monitoring host to point a real host at a known bad ip address. Requires root, and that you use dns for your hostaddress entries, however. -- Mike Lindsey -- Learn how Oracle Real Application Clusters (RAC) One Node allows customers to consolidate database storage, standardize their database environment, and, should the need arise, upgrade to a full multi-node Oracle RAC database without downtime or disruption http://p.sf.net/sfu/oracle-sfdevnl___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] NAGIOS_ environment variables in a notification script
On 12/22/10 6:17 AM, Marc Haber wrote: > Despite having set enable_environment_macros=1 in my nagios.cfg, the > notification script only sees NAGIOS_PLUGIN=/path/bin/notify. > > What am I doing wrong? > > I'm using Nagios 3.0.6 from Debian lenny. Any hints will be appreciated. enable_environment_macros should override use_large_installation_tweaks, which is what can also disable environment macros. Perhaps your version is not acting as suspected? See if you have u_l_i_t enabled, and if so, try disabling it. If that isn't it, try setting debug_level=2 (and debug_file, etc). Restart and check the debug output to see if it's actually seeing the config directive. Perhaps you have a typo. Then maybe set debug_level=32 and run a few notification tests (or just set it to 34 initially so you get notification and configuration debugging)... Also, consider upgrading. Nagios 3.2+ is great. -- Mike Lindsey -- Learn how Oracle Real Application Clusters (RAC) One Node allows customers to consolidate database storage, standardize their database environment, and, should the need arise, upgrade to a full multi-node Oracle RAC database without downtime or disruption http://p.sf.net/sfu/oracle-sfdevnl ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios kept from restarting after reboot by lock file
On 12/20/10 8:16 AM, eric.b...@barclayscapital.com wrote: > Alternatively, could you recommend a good system/resource monitoring tool > that would be able to let me know if nagios is down and restart it > automatically? > Add a cronjob on a five (or whatever you're comfortable with) minute interval, similar to: #!/bin/bash PATH=/bin:/usr/bin:/usr/local/bin PID=`cat /home/nagios/nagios/var/nagios.lock` PIDTEST=`kill -0 ${PID} 2>&1 >/dev/null` if [ "${PIDTEST}" -eq "1" ] then rm /home/nagios/nagios/var/nagios.lock # INSERT RESTART COMMAND HERE echo "Killed Lockfile and restarted Nagios" | mail -s "Nagios restart `hostname`" your-em...@here.com fi >>> Just be aware that it'll also trigger that if block, if nagios is running under a different username. You can check for that by doing some tests in the script with ps and grep. > _ > From: Berg, Eric: IT (NYK) > Sent: Monday, December 20, 2010 11:03 AM > To: 'nagios-users@lists.sourceforge.net' > Subject:Nagios kept from restarting after reboot by lock file > > Gee, this seems like an annoying newbie problem, but if Nagios crashes or is > killed (as on system reboot), it leaves a lock file around that prevents it > from starting again until the lock file is manually removed. > > I see this on Monday mornings after weekend reboots on a Red Hat Linux box: > > nagios: Lockfile '/home/nagios/nagios/var/nagios.lock' looks like its already > held by another instance of Nagios (PID 0). Bailing out... Sounds like something in the shutdown process is throwing a 0 into the pid file, or the startup in the rc script is. Either way, you should never have a 0 in there, either the rc script is putting the wrong data in there, or it's reporting incorrectly. > Does anyone know if there's a config option or something else that obviates > the need to write a wrapper scropt to check to see if Nagios is really > running and remove the lock file (look slike Nagios already knows it's not > running by virtue of the value of the PID inthis very message!) so that it > can cleanly start up again? -- Mike Lindsey -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] CPU monitor for a single Linux user space process ?
On 12/15/10 5:05 PM, Bruce Edge wrote: > Rookie question here. Trying to determine nagios suitability for an > embedded app. > > Can I monitor the CPU utilization for a single user space process on a > Linux box with nagios? > And, can I define an action if it exceeds a threshold? Sounds like you need check_snmp_process.pl from here: http://nagios.manubulon.com/snmp_process.html I've been using it, it works quite well. It requires snmpd, but is basically the swiss army knife of user-space process monitoring. To "define an action" you need to set up an event handler. http://nagios.sourceforge.net/docs/3_0/eventhandlers.html -- Mike Lindsey -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] 3.2.1 and non-incrementing notification count?
Had bad source. Recompiled and it's gone away. On Aug 15, 2010, at 5:37 PM, Mike Lindsey wrote: > I just migrated to a 3.2.1 instance, from a 3.0.6 instance, with the same > configuration. Now I have some UNKNOWN results that are generating a > notification every five minutes. The notification interval on the > services is 720 minutes, and same for the service escalation. > > Every five minutes I'm getting a new email from the services, and the > notification counts in the ui still reads '0'. > > Anyone seen this before? > > -- > Mike Lindsey > > -- > This SF.net email is sponsored by > > Make an app they can't live without > Enter the BlackBerry Developer Challenge > http://p.sf.net/sfu/RIM-dev2dev > ___ > Nagios-users mailing list > Nagios-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nagios-users > ::: Please include Nagios version, plugin version (-v) and OS when reporting > any issue. > ::: Messages without supporting info will risk being sent to /dev/null -- This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] 3.2.1 and non-incrementing notification count?
I just migrated to a 3.2.1 instance, from a 3.0.6 instance, with the same configuration. Now I have some UNKNOWN results that are generating a notification every five minutes. The notification interval on the services is 720 minutes, and same for the service escalation. Every five minutes I'm getting a new email from the services, and the notification counts in the ui still reads '0'. Anyone seen this before? -- Mike Lindsey -- This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Escalate after X warnings or criticals
If it hasn't, I'll be adding it myself and will be happy to submit my patches back. I've been needing this functionality for awhile, and was planning on rolling it in, in the next 2-3 months. Andrew Li wrote: > Does anyone know if the notification count problem got fixed in 3.2.1? > > I had a read of the ChangeLog but it doesn't mention anything related to > this problem since 3.0.6. -- Mike Lindsey -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] extra "checkresults" files being left behind
Mathew Walker wrote: > I'm running Nagios on a little VPS box checking a few hosts/services > (~50 checks). It's mostly a testing platform for me and checks in on my > other test VPS systems. > > However I keep seeing the extra check results data files build up in > /usr/local/nagios/var/spool/checkresults like: > -rw--- 1 nagios nagios249 Jun 7 23:45 checknbu01O > -rw--- 1 nagios nagios252 Jun 8 02:40 checkHxcsiJ > > Googled a bit and didn't come up with much relevant. Any thoughts? If I remember correctly, the parent nagios process writes out that file, then forks a child. The child then runs the check, updates that file and then creates a file with the same name, plus '.ok' in that directory, letting the parent process know the check is completed. So, take a look at the contents of several of those files, if you're lucky, you'll see that either they are for the same host, or the same service check. If so, there might be something in the way that host or service is getting polled that is causing the forked child to die. Also, if you're running a version older than 3.0rc1 (generally always a good thing to include the version of the tool you're useing, when asking for help) then you may want to upgrade, that version fixed a bug that might be related: "Fixed bug with not deleting old check result files that contained results for invalid host/service" -- Mike Lindsey -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Escalations - Warning to Critical, without skipping?
So, here's my situation. I've got around 10k checks, Warnings do not notify, because we have historically had issues with Warning notifications (from the contact group setting) going out, then a service turning critical and the pager escalations (which only include critical) skipping directly to "Page everyone, and a couple managers" because we'd already had 3 warning-level notifications. So, now all contacts have warning notifications disabled. Which leads to missed events. Is there any way to notify on warnings, without incrementing the notification count, and affecting escalations? What I want is: Warnings notify, and when a service turns Critical, it always starts at step 1 of the escalations. That way, ops and dev can get notifications about service issues, before we get to the point where we need to page about it. And when it does get to be paging time, nagios isn't waking up management at 4am. I'd love to avoid having duplicate service checks, with a "warning" check that has warning notifications enabled, and a "critical" check with warning notifications disabled. Ideal would be some manner of having split escalations, where it tracks the number of notifications of a specific state, and escalates based on that, but it looks like that requires some serious refactoring of the code. (Running 3.0.6) -- Mike Lindsey -- ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Full Throttle Nagios
Marcel wrote: > When I have more than, say, 10k checks, I start seen check latency rises > and there just isn't anything that could be done, even distributed > monitoring have the nagios.cmd write-lock bottleneck. So, I've just gone through this, and the single greatest bottleneck I had to deal with is notifications. But, I have a lot of people in the notification tree, and pull in a lot of meta-data to make ticket tracking and issue resolution easier and faster. Since Nagios needs to know the exit status of notification commands, it doesn't fork before notifications.. it just plods along waiting for the notification command to exit. I switched all our non-pager notification commands to drop a spool file in a directory, letting another process read the spool files, generate email contents, query ticket databases, pull in documentation or extended testing information (full mysql processlist output, for dbas.. etc) and caching it for subsequent notifications for that event. That showed a HUGE improvement to my master server's performance. If notifications aren't your bottleneck, you can move all your temporary files to ramdisk. You can also increase your FIFO pipe size, but that only delays the issue and doesn't really solve the problem if you're always running hot. It also probably involves recompiling your kernel. If you're using nsca, you can cache your updates for a second or two, so that multiple updates happen in the same socket connection. Alternately (or additionally) you can have nsca update the checkresults directory, directly, skipping the steps where nagios reads the command pipe, and then just writes it back out to the checkresults directory. I can package up a patch (against 2.7.2) of those last couple changes (I need to submit them, anyway). If you're manlier than I might be, you could also consider modifying the core nagios to allow submissions from distributed nagios servers, directly to a socket, but doing that right might require serious threaded c foo, and depending on your OS and threading library, you might be locked to a single core. So, you have options. They're not all equal, and aren't all easy. But you wouldn't be working with monitoring if you didn't like challenges... :) -- Mike Lindsey -- ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] nagios without web interface
Leonardo Carneiro - Veltrac wrote: > Hi. I want to compile nagios without the web interface. I think that i > should include these parameters in the configure: > > --disable-statusmap > --disable-statuswrl > --without-httpd-conf > > Is this right? There is anything else that i should include (or exclude)? When you run make, just do: make nagios make install-base You could also build everything, and just skip the cgi install, and that would probably take less time than getting your answer took. -- Mike Lindsey -- ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Customizing notifications
Chip Burke wrote: > I have a request to “plain English”-ify my notifications. One item I > have been asked for is when the service state changes, to report the > duration of the previous service state. > > Example: HTTP is now OK after 00:02:35 of down time. > > Is there an easy way to do this? It seems Nagios doesn’t offer a Last > State Duration macro, so I am assuming this is going to be a matter of > some sort of custom scripting. Has anyone had experience with this sort > of thing? Likely, your best option will be to set up an event handler script for that service. If you already have event handlers configured, and you want this logic to run everywhere, consider setting up a script like this for your global event handler. In the event handler, you will want to touch a file in /tmp based on the host, service, and state, whenever there's a hard state change. Like, /tmp/localhost-load-ok... You could even simplify if all you care is ok/not ok. Then in your notification script, just check for the presence of those files, and do your date calculation by pulling the modification date out with stat (or script code, if your notification command isn't a chunk of bash). Something like: now=`date +%s` if [ "${NAGIOS_LASTSSERVICESTATE}" == "OK"] then time=`echo ${now} - ${filetime} | bc` filetime=`stat -f "%m" /tmp/localhost-load-notok` else time=`echo ${now} - ${filetime} | bc` filetime=`stat -f "%m" /tmp/localhost-load-ok` fi echo "${NAGIOS_SERVICE} is now ${NAGIOS_SERVICESTATE} after ${time} seconds." You might want to flesh it out with some file-exists tests as well. Good luck! -- Mike Lindsey -- The Planet: dedicated and managed hosting, cloud storage, colocation Stay online with enterprise data centers and the best network in the business Choose flexible plans and management services without long-term contracts Personal 24x7 support from experience hosting pros just a phone call away. http://p.sf.net/sfu/theplanet-com ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Event Handlers
Marc Powell wrote: > On Feb 3, 2010, at 8:16 AM, Jeff wrote: > >> I have a service that needs to be monitored every minute. I need some help >> understanding how services go from soft to a hard state > > When a service check results in a non-OK state, services go from a Soft to a > Hard state when they reach max_check_attempts. > http://nagios.sourceforge.net/docs/3_0/statetypes.html > >> and if an event handler can be run after a service has gone into a hard >> state. > > Only for it's initial Hard problem state or initial Hard recovery state. > http://nagios.sourceforge.net/docs/3_0/eventhandlers.html > >> I'm sure everyone has a very dynamic and custom environment to some extent. >> I have event handlers that will not run if a lock file is present (cause i >> am deploying code or so other scripts do not step on each other). So I for >> this service that I monitor every minute, I have Max Retries set to 3, Check >> Interval is 1, and retry interval is 1. Can someone help shed some light on >> how I can get an event handler to run again after a service has gone into a >> hard state? > > You can't really... The only real facility nagios has to do this (that I can > think of right now) is is_volatile > (http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html#service) but > that's probably overkill for your needs; particularly the notification > implications. The other possibility for having something run every time the service is checked, is to configure your ocsp_command. Not exactly what it's generally used for, but it'll do in a pinch. -- Mike Lindsey -- The Planet: dedicated and managed hosting, cloud storage, colocation Stay online with enterprise data centers and the best network in the business Choose flexible plans and management services without long-term contracts Personal 24x7 support from experience hosting pros just a phone call away. http://p.sf.net/sfu/theplanet-com ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Overloaded master
A typical first tier notification goes to 20 people. One of those will be a pager, and is very simple. The rest are fairly complex. Notifications include a link to existing and recent tickets in our ticketing system (this also allows me to not send a ticket opening notification if a ticket already exists).. I populate the notification with links to cacti graphs, links to wiki documentation for the event as well as fire off a secondary notification handler that adds in additional information based on the host, service, and state. The first notification of the cycles does all the heavy lifting and takes about 6 seconds. The other 19 finish relatively quickly. I've been thinking of building a notification server - so I could have separate and discrete notification escalations for different service states - which would also let me fire off one notification with just the contents of $ENV{NAGIOS_*}.. Perhaps that's my best option? Martin Melin wrote: > What kind of notifications are you doing and how many are you sending > out? Why does a notification cycle take 9 seconds to complete? > > On Sat, Jan 23, 2010 at 12:13 AM, Mike Lindsey <mailto:mike-nag...@5dninja.net>> wrote: > > What kind of options does one have, if your master nagios server is > getting overloaded? > > I have half a dozen slaves doing polling, submitting passive check > results back via send_nsca. The master does no active polling, just > event processing, notifications, and web ui. > > Under normal circumstances, it works alright. But after a restart it > can take up to half an hour before the master catches up; and if there > are a lot of events, the act of sending out notifications can cause it > to fall behind. > > I'm pre-caching my object file, I'm skipping circular dependency checks, > and I've gotten a notification cycle down to 9 seconds. I tried > modifying nagios to fork before notifications, but that failed pretty > spectacularly; so that 9 seconds is a time where 900 or so passive check > submissions block until the notifications are done. > > Are there any options for running a dual-master setup, or other ways to > spread the load across multiple machines? > > Has anyone patched nsca to submit check results into the checkresults > directory, instead of via the nagios.cmd pipe? What kind of improvement > can one expect from that? > > Any other advice? -- Mike Lindsey -- The Planet: dedicated and managed hosting, cloud storage, colocation Stay online with enterprise data centers and the best network in the business Choose flexible plans and management services without long-term contracts Personal 24x7 support from experience hosting pros just a phone call away. http://p.sf.net/sfu/theplanet-com ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Overloaded master
What kind of options does one have, if your master nagios server is getting overloaded? I have half a dozen slaves doing polling, submitting passive check results back via send_nsca. The master does no active polling, just event processing, notifications, and web ui. Under normal circumstances, it works alright. But after a restart it can take up to half an hour before the master catches up; and if there are a lot of events, the act of sending out notifications can cause it to fall behind. I'm pre-caching my object file, I'm skipping circular dependency checks, and I've gotten a notification cycle down to 9 seconds. I tried modifying nagios to fork before notifications, but that failed pretty spectacularly; so that 9 seconds is a time where 900 or so passive check submissions block until the notifications are done. Are there any options for running a dual-master setup, or other ways to spread the load across multiple machines? Has anyone patched nsca to submit check results into the checkresults directory, instead of via the nagios.cmd pipe? What kind of improvement can one expect from that? Any other advice? -- Mike Lindsey -- Throughout its 18-year history, RSA Conference consistently attracts the world's best and brightest in the field, creating opportunities for Conference attendees to learn about information security's most important issues through interactions with peers, luminaries and emerging and established companies. http://p.sf.net/sfu/rsaconf-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] nagios blocking on notifications?
Turns out nagios doesn't fork before handling notifications, and also waits for the children of any notification commands to exit, so forking inside my notification script won't help. I took the part of the script that was taking 5-6 seconds to complete and added in a cache mechanism, which changed the 90+ second notification cycle, to a 6-8 second notification cycle. Might be overkill, but I've also wrapped some fork() logic around the service_notification() call inside handle_async_service_check_result().. Compiles and runs, I'll stress test it tonight and see how it does with real load, tomorrow. Also, if there's a better way to do this, I'm all ears. Mike Lindsey wrote: > I've got a high volume site. Everything seems to keep up reasonably > well, unless there are a good number of state changes. Once services > start changing state, and notifications start getting sent out, nagios > falls behind. > > Did some digging in the logs and it looks like while a batch of > notifications are being sent out, it's rate limiting to about one per > five seconds. Also, from the first notification for a service to the > last notification for that service, nothing else is written to the logs. > > Since a typical notification goes out to 15+ people, that's over a > minute with no service check handling. > > Is there something going on under the hood that I'm not aware of (like, > is it just not doing the log writing, but still doing the passive > service check handling, and there's something else causing my latency?) > > Is that delay configurable? I don't see anything in the docs for that. > > I've even set my notification script to just call and background a > secondary script, to try and see if it wasn't a delay in the > notification script, but that seemed not to do anything at all. Should > I be forking the notification script instead? > > Here's a log snippet: > [1263505735] EXTERNAL COMMAND: > PROCESS_SERVICE_CHECK_RESULT;;System Check;0;OK load mem ntp > swap cfengine disk| > [1263505735] EXTERNAL COMMAND: > PROCESS_SERVICE_CHECK_RESULT;;System Check;0;OK load mem ntp > swap cfengine disk| > [1263505735] EXTERNAL COMMAND: > PROCESS_SERVICE_CHECK_RESULT;;System Check;1;WARNING [swap > utilization 25%] [/data/ at 77% (inodes 0%)]| > [1263505735] PASSIVE SERVICE CHECK: > ;check_mtime-redlist.txt;0;OK - redlist.txt 102 seconds old > [1263505735] PASSIVE SERVICE CHECK: ;pre_queuedepth;2;CRITICAL > - pre_queuedepth status: 2159 > 500 > > [1263505735] SERVICE NOTIFICATION: > ;;pre_queuedepth;CRITICAL;notify-by-email;CRITICAL - > pre_queuedepth status: 2159 500 > [1263505741] SERVICE NOTIFICATION: > ;;pre_queuedepth;CRITICAL;notify-by-email;CRITICAL - > pre_queuedepth status: 2159 500 > > > The SERVICE NOTIFICATION entries keep rolling in every 5-6 seconds for > the next minute+, then it goes back to it's usual happy speed. > > Is this an artifact of the way it logs, or is the whole system choking > while it sends email? I've searched the list archives and not found > anything on this. > -- Mike Lindsey -- Throughout its 18-year history, RSA Conference consistently attracts the world's best and brightest in the field, creating opportunities for Conference attendees to learn about information security's most important issues through interactions with peers, luminaries and emerging and established companies. http://p.sf.net/sfu/rsaconf-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] nagios blocking on notifications?
I've got a high volume site. Everything seems to keep up reasonably well, unless there are a good number of state changes. Once services start changing state, and notifications start getting sent out, nagios falls behind. Did some digging in the logs and it looks like while a batch of notifications are being sent out, it's rate limiting to about one per five seconds. Also, from the first notification for a service to the last notification for that service, nothing else is written to the logs. Since a typical notification goes out to 15+ people, that's over a minute with no service check handling. Is there something going on under the hood that I'm not aware of (like, is it just not doing the log writing, but still doing the passive service check handling, and there's something else causing my latency?) Is that delay configurable? I don't see anything in the docs for that. I've even set my notification script to just call and background a secondary script, to try and see if it wasn't a delay in the notification script, but that seemed not to do anything at all. Should I be forking the notification script instead? Here's a log snippet: [1263505735] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;;System Check;0;OK load mem ntp swap cfengine disk| [1263505735] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;;System Check;0;OK load mem ntp swap cfengine disk| [1263505735] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;;System Check;1;WARNING [swap utilization 25%] [/data/ at 77% (inodes 0%)]| [1263505735] PASSIVE SERVICE CHECK: ;check_mtime-redlist.txt;0;OK - redlist.txt 102 seconds old [1263505735] PASSIVE SERVICE CHECK: ;pre_queuedepth;2;CRITICAL - pre_queuedepth status: 2159 > 500 [1263505735] SERVICE NOTIFICATION: ;;pre_queuedepth;CRITICAL;notify-by-email;CRITICAL - pre_queuedepth status: 2159 500 [1263505741] SERVICE NOTIFICATION: ;;pre_queuedepth;CRITICAL;notify-by-email;CRITICAL - pre_queuedepth status: 2159 500 The SERVICE NOTIFICATION entries keep rolling in every 5-6 seconds for the next minute+, then it goes back to it's usual happy speed. Is this an artifact of the way it logs, or is the whole system choking while it sends email? I've searched the list archives and not found anything on this. -- Mike Lindsey -- Throughout its 18-year history, RSA Conference consistently attracts the world's best and brightest in the field, creating opportunities for Conference attendees to learn about information security's most important issues through interactions with peers, luminaries and emerging and established companies. http://p.sf.net/sfu/rsaconf-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null