Hello, thanks a lot for providing this script, but I got a problem:
A Slave-Node-Check returns "CRITICAL" , Slave might be down. The checks of the machines are passed to the master normally, although the check returns critical. When I manually restart opsview-slave, the tunnel will be reestablished an the check returns "OK". But when I manually execute the retrieve_opsview_slave script (When Slave-node-check returns Critical), it returns "ok = nsca = 0" and adds the timestamp in the txt-temp. The timestamp seems to be okay, 'cause the small script isn't restarting the opsview-slave-process, although the node-check returns critical. Any Ideas? Thanks a lot for help! Greets Mario From: Andrew Hall <[email protected]> To: opsview-users <[email protected]> Date: 22.05.2009 18:17 Subject: Re: [opsview-users] periodic autossh issues with reversetunneldropping Sent by: [email protected] On 2009-05-22 10:25, Ton Voon wrote: > Can you model this nagios plugin on check_opsview_slave_cluster? Maybe > call it check_opsview_slave_communication? > For extra bonus points, you could create this service automatically on > all slaves when the reverse_ssh flag is set. OK, time to fess up - I couldn't write a perl script if my life depended on it. I can just about read the work of others, but the Llama book continues to gather dust on my bookshelf. However, I have written an extraordinarily simple bash script which after a fair bit of testing seems to function just fine. Unfortunately it resides outside of Opsview but if anyone wishes to take the logic and run with it then please feel free. Anyhow, if you're interested here's what we now do to fix this issue... 1. Edit the retrieve_opsview_info script on the slave and add this line just under use strict; system ("/bin/date +%s > /usr/local/nagios/tmp/slave-check-time.txt"); (See, I can add bash into perl just fine !) 2. Add the following script which cron runs every minute: #!/bin/bash NOW=$(/bin/date +%s) SLAVECHECK=$(/bin/cat /usr/local/nagios/tmp/slave-check-time.txt) DIFFERENCE=$(($NOW-$SLAVECHECK)) [ $DIFFERENCE -gt 330 ] && /etc/init.d/opsview-slave restart && /bin/echo "Slave connection down! Re-starting opsview-slave service on `hostname`..." | /bin/mail -s "opsview slave restart: `hostname`" [email protected] && /bin/date +%s > /usr/local/nagios/tmp/slave-check-time.txt && echo $NOW $SLAVECHECK $DIFFERENCE A frighteningly simple one liner that I'm sure could be ripped to shreds / re-written much better but hey - it works. So essentially... 1. The check from the master runs as normal and the timestamp on the slave updates. 2. The script checks the time and as less that 330 seconds has passed it does nothing. 3. The listening port on the master dies and the timestamp on the slave no longer updates. 4. The script checks the time and more than 330 seconds has passed so it... a. re-starts the opsview-slave process thus re-establishing the tunnel. b. informs the ops team that this has happened. c. updates the timestamp so it doesn't attempt another re-start before the master can update the timestamp itself (the script runs every minute whereas the master check runs every 5 minutes). d. echo's the variables to stdout so we can check the user's mailbox for the values if we wish (more for troubleshooting than anything else). And that's about it. We've been testing it today by replicating the situation and it seems to work just fine. I hope this is of interest / help to anyone else who may experience this issue. _______________________________________________ Opsview-users mailing list [email protected] http://lists.opsview.org/listinfo/opsview-users
_______________________________________________ Opsview-users mailing list [email protected] http://lists.opsview.org/listinfo/opsview-users
