Hello,

thanks a lot for providing this script, but I got  a problem:

A Slave-Node-Check returns "CRITICAL" , Slave might be down. The checks of 
the machines are passed to the master normally, although the check returns 
critical. When I manually restart opsview-slave, the tunnel will be 
reestablished an the check returns "OK". 

But when I manually execute the retrieve_opsview_slave script (When 
Slave-node-check returns Critical), it returns "ok = nsca = 0" and adds 
the timestamp in the txt-temp. The timestamp seems to be okay, 'cause the 
small script isn't restarting the opsview-slave-process, although the 
node-check returns critical. 

Any Ideas?

Thanks a lot for help!

Greets
Mario 




From:
Andrew Hall <[email protected]>
To:
opsview-users <[email protected]>
Date:
22.05.2009 18:17
Subject:
Re: [opsview-users] periodic autossh issues with reversetunneldropping
Sent by:
[email protected]



On 2009-05-22 10:25, Ton Voon wrote:

> Can you model this nagios plugin on check_opsview_slave_cluster? Maybe
> call it check_opsview_slave_communication?

> For extra bonus points, you could create this service automatically on
> all slaves when the reverse_ssh flag is set.

OK, time to fess up - I couldn't write a perl script if my life
depended on it. I can just about read the work of others, but the
Llama book continues to gather dust on my bookshelf.

However, I have written an extraordinarily simple bash script which
after a fair bit of testing seems to function just fine. Unfortunately
it resides outside of Opsview but if anyone wishes to take the logic
and run with it then please feel free.

Anyhow, if you're interested here's what we now do to fix this issue...

1. Edit the retrieve_opsview_info script on the slave and add this
line just under use strict;

system ("/bin/date +%s > /usr/local/nagios/tmp/slave-check-time.txt");

(See, I can add bash into perl just fine !)

2. Add the following script which cron runs every minute:

#!/bin/bash
NOW=$(/bin/date +%s)
SLAVECHECK=$(/bin/cat /usr/local/nagios/tmp/slave-check-time.txt)
DIFFERENCE=$(($NOW-$SLAVECHECK))
[ $DIFFERENCE -gt 330 ] && /etc/init.d/opsview-slave restart &&
/bin/echo "Slave connection down! Re-starting opsview-slave service on
`hostname`..." | /bin/mail -s "opsview slave restart: `hostname`"
[email protected] && /bin/date +%s >
/usr/local/nagios/tmp/slave-check-time.txt && echo $NOW $SLAVECHECK
$DIFFERENCE

A frighteningly simple one liner that I'm sure could be ripped to
shreds / re-written much better but hey - it works.

So essentially...

1. The check from the master runs as normal and the timestamp on the
slave updates.

2. The script checks the time and as less that 330 seconds has passed
it does nothing.

3. The listening port on the master dies and the timestamp on the
slave no longer updates.

4. The script checks the time and more than 330 seconds has passed so 
it...

a. re-starts the opsview-slave process thus re-establishing the tunnel.

b. informs the ops team that this has happened.

c. updates the timestamp so it doesn't attempt another re-start before
the master can update the timestamp itself (the script runs every
minute whereas the master check runs every 5 minutes).

d. echo's the variables to stdout so we can check the user's mailbox
for the values if we wish (more for troubleshooting than anything
else).

And that's about it.

We've been testing it today by replicating the situation and it seems
to work just fine.

I hope this is of interest / help to anyone else who may experience this 
issue.
_______________________________________________
Opsview-users mailing list
[email protected]
http://lists.opsview.org/listinfo/opsview-users


_______________________________________________
Opsview-users mailing list
[email protected]
http://lists.opsview.org/listinfo/opsview-users

Reply via email to