
I am running SA v4.1.1609 on a server running windows server 2003.  I
have the software running as a service, performing 3 dozen checks or so,
a mix of pings, windows services, etc.  Each check outputs to the
default web page which is then rendered using a template.  Checks are
done every three minutes M-f 9-5, and then every 10 minutes otherwise.  

All checks and alerts are working well, however every few weeks the
service crashes and I get "Event ID 7034 The Servers Alive service
terminated unexpectedly.  It has done this 1 time(s)" message in the
system event log.  There is no other information available.  There are
no other unusual events recorded in the OS logs at or near the time of
the SA service failure.

My problems are: 

1)when this happens the service cannot be reliably re-started.  It can
be "started" via the services MMC snap-in, but the checks do not run and
the web page does not get updated.  Further attempts to manage the
service via the MMC are met with a "service did not respond in a timely
blah blah blah."  If I terminate the re-started service using task
manager, and then run the SA application software (instead of using the
service) the check cycle will run once and update the web page, but
after, checks are not run and the web page is not updated.  Also,
attempts to exit the application are ignored, and task manager shows it
as Not Responding.  Only a server re-boot will fix the problem.

Can anyone suggest why this might be happening, places to look for clues
as to the cause, or ways to recover the SA service short of a re-boot?

2)I do not have a reliable way of being alerted when the SA service has
crashed on this server.  I user another instance of SA on a another
server to check to see if the SA service is running on the first server,
but it fails to alert me when the SA service crashes under these
circumstances.  (Yes Dirk, I have the check and alert configured
properly.)  The way I discover the failure is I visit the web page and
the last updated time is way old.  Has anyone ever tried to compare the
last updated time on the HTML page to the current server time, and then
alert if the difference is greater than the check cycle frequency?  Are
there other methods of monitoring the health of SA?

Thanks for reading this longish post and for any suggestions (related to
SA administration and troubleshooting) readers can provide.



[This E-mail scanned for viruses by Declude Virus]

To unsubscribe from a list, send a mail message to [EMAIL PROTECTED]
With the following in the body of the message:
   unsubscribe SAlive

Reply via email to