Hello, I am running SA v4.1.1609 on a server running windows server 2003. I have the software running as a service, performing 3 dozen checks or so, a mix of pings, windows services, etc. Each check outputs to the default web page which is then rendered using a template. Checks are done every three minutes M-f 9-5, and then every 10 minutes otherwise.
All checks and alerts are working well, however every few weeks the service crashes and I get "Event ID 7034 The Servers Alive service terminated unexpectedly. It has done this 1 time(s)" message in the system event log. There is no other information available. There are no other unusual events recorded in the OS logs at or near the time of the SA service failure. My problems are: 1)when this happens the service cannot be reliably re-started. It can be "started" via the services MMC snap-in, but the checks do not run and the web page does not get updated. Further attempts to manage the service via the MMC are met with a "service did not respond in a timely blah blah blah." If I terminate the re-started service using task manager, and then run the SA application software (instead of using the service) the check cycle will run once and update the web page, but after, checks are not run and the web page is not updated. Also, attempts to exit the application are ignored, and task manager shows it as Not Responding. Only a server re-boot will fix the problem. Can anyone suggest why this might be happening, places to look for clues as to the cause, or ways to recover the SA service short of a re-boot? 2)I do not have a reliable way of being alerted when the SA service has crashed on this server. I user another instance of SA on a another server to check to see if the SA service is running on the first server, but it fails to alert me when the SA service crashes under these circumstances. (Yes Dirk, I have the check and alert configured properly.) The way I discover the failure is I visit the web page and the last updated time is way old. Has anyone ever tried to compare the last updated time on the HTML page to the current server time, and then alert if the difference is greater than the check cycle frequency? Are there other methods of monitoring the health of SA? Thanks for reading this longish post and for any suggestions (related to SA administration and troubleshooting) readers can provide. David ------------------------- [This E-mail scanned for viruses by Declude Virus] To unsubscribe from a list, send a mail message to [EMAIL PROTECTED] With the following in the body of the message: unsubscribe SAlive