I am trying to chase down a couple of periodic issues with Running the
reverse ssh Master/Slave setup of Opsview.

I'm having some time drift issues running Ubuntu under VMWare ESX but
running ntp with a good set of time servers has taken care of the issue
for the most part.

 

The main issue I still haven't figured out is some sites periodically
drop the reverse tunnel and it autossh doesn't seem to be able to
reestablish it.

I've had multiple times now where I will get a slave down notification
from the master server yet checks from the remote server are still able
to pass through.

Connections from the master to slave however get a connection refused,
so it's like the reverse tunnel can't be reestablished.

On the slave server you will see autossh trying multiple times to keep
ssh alive but the connection exits with a 255 status.

 

Here is a log extract of the autossh error condition, at the end is a
opsview-slave restart which fixes the problem.

 

May  4 18:05:47 nms-site-A-s01 autossh[5935]: starting ssh (count 256)

May  4 18:05:47 nms-site-A-s01 autossh[5935]: ssh child pid is 24945

May  4 18:13:34 nms-site-A-s01 autossh[5935]: ssh exited with error
status 255; restarting ssh

May  4 18:13:34 nms-site-A-s01 autossh[5935]: starting ssh (count 257)

May  4 18:13:34 nms-site-A-s01 autossh[5935]: ssh child pid is 26979

May  4 18:15:14 nms-site-A-s01 autossh[5935]: ssh exited with error
status 255; restarting ssh

May  4 18:15:14 nms-site-A-s01 autossh[5935]: starting ssh (count 258)

May  4 18:15:14 nms-site-A-s01 autossh[5935]: ssh child pid is 27597

May  4 21:50:35 nms-site-A-s01 autossh[5935]: received signal to exit
(15)

May  4 21:50:40 nms-site-A-s01 autossh[26862]: port set to 0, monitoring
disabled

May  4 21:50:40 nms-site-A-s01 autossh[26863]: starting ssh (count 1)

May  4 21:50:40 nms-site-A-s01 autossh[26863]: ssh child pid is 26864

 

I would like to fix or workaround this problem.

Issues are:

-          when the condition  occurs the master knows of the problem
but can't send a restart command to the slave 

-          the slave might not know anything is wrong to take any action
to fix 

-          if the slave could  detect the error condition maybe a event
handler would restart the opsview-slave service?

 

Anyway any suggestions would be appreciated, I would like the monitoring
system to be somewhat self-healing unless true downtime is occuring.

This is a case where 24/7 engineers would get paged in the middle of the
night for something that is not true downtime of a site.

 

James Whittington

VC3, Inc.

 

 

_______________________________________________
Opsview-users mailing list
[email protected]
http://lists.opsview.org/listinfo/opsview-users

Reply via email to