On Monday 03 November 2003 5:19 pm, Steve Lane wrote:
> Hi all:
>
> We maintain a number of web-based applications that use postgres as
> the back end....
>
> The client responded that surely this problem of monitoring a
> database-backed web app was a known, solved problem, and wanted to
> know what other people did to solve the problem.

Do your best to anticipate and then plug holes as found. A sad story 
(different OS/DB/server but same idea - this tale involves 
NT/IIS/Cold Fusion/MSSQL).

Back in the day at a dot-com the colo would "monitor the servers". By 
monitor they meant "ping". Of course IIS had a habit of playing dead 
and pings worked fine.

So the colo added a port 80 monitor to the servers. This would alarm 
if a connection to port 80 was refused. Turned out, however, that IIS 
could die leaving a Dr. Watson message on the screen. It would 
continue accepting connections to port 80 but do nothing with them - 
at least until the Dr. Watson warning was clicked at which time the 
alarm would go off. Useless.

So we switched to regex testing. The monitoring system would look for 
a special page - something like /test.html and make sure the correct 
text, "IIS running" was returned.

But it turned out that Cold Fusion could die on its own. So we changed 
the test page to look at /test.cf (or whatever Cold Fusion used as a 
extension - I don't care to remember). That page concatenated a 
couple of strings and returned the result. Cool, we were much better 
at trapping events.

But what about the database? We changed the ColdFusion page to run a 
very simple query - something like "select 0" (see a thread from a 
couple months back regarding "what's the fastest query" which had to 
do with PG server monitoring) and if it got the correct result it 
would return something like "db running".

We were happy with this arrangement till we discovered that _parts_ of 
ColdFusion could die and the rest could run fine?!? The tests worked 
most of the time but when CF "half-died" one page of the site that 
pulled data from another web site would not work.

So we switched everything to a Java based app server and were able to 
handle twice the load with 1/7th the machines and crashing became a 
thing of the past - but I digress.

We used the same basic tests on the new server. We had a static page 
served by the front-end, a simple page served by the app server, and 
one that checked the database server. The colo monitors checked the 
database testing page and the others allowed for some quick-n-dirty 
remote diagnosis (hmm, front end and app servers are running but db 
isn't responding to the app server could be determined in 30 seconds 
from any browser).

In addition we automatically checked the pages from our office and I 
checked from a server at home. The checks ran once per minute and 3 
consecutive fails would trigger a page. I'm sure there are many 
things that could have fooled us but they are rare enough that we 
never saw them - the monitoring worked like a charm. 

Don't forget that you need to make sure the monitoring is happening. 
It's easy to lose track of a well-written monitoring app when there 
are no failures and only find that someone turned the monitor program 
off when a real failure happens. We figured that the combination of 
our monitoring along with the colo monitoring offered enough 
redundancy.

Obviously it's best if at least some of the monitoring comes from 
off-site and never trust a machine to monitor itself.

BTW, some of these server test pages can be used by a load balancer to 
fail a server in a cluster so they are very handy for more than just 
testing.

Oh, to answer your other question - the problem has been "solved". You 
can pay for very expensive monitoring from a variety of third 
parties.

Cheers,
Steve


---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Reply via email to